Tag: search

SQL Server word-level bigram function
I work for an online auction website, Trade Me, and as such collect quite a large quantity of search data (i.e. full text searches that people perform). Over the years I’ve been involved in various aspects of our search; from tweaking the nuts and bolts of the full-text search itself (see my blog posts about boolean search operators for example) to working on our recommendation algorithm, and implementing search auto-suggest.

Something I’m looking into at the moment is changing the way we recommend alternative search terms when we think you’ve misspelt something – i.e. our “did-you-mean” system.

Did you mean…?

The way we currently do it is based on taking the individual words entered, and generating a large number of alternative spellings for each of them, and then comparing those lists of words against a “known-good” list of search terms and returning the most likely one for each word entered (according to a magic score based on frequency, average results per search, etc). This is done on the fly and is surprisingly quick (it’s all done within TSQL functions and procedures), but I want to experiment with word level n-grams (unigrams, bigrams, and trigrams specifically).

Basically, this involves taking each search string and splitting it into individual words to give you your unigrams, or then pairing them up (in the case of bigrams).

For example, the string “pink ipod nano” would be returned as:

Word 1 Word 2
# pink
pink ipod
ipod nano
nano #

The hash symbols are delimiters and are sometimes not used, in which case the 1st and last rows of this table wouldn’t be returned. In my function below, I use a less common symbol (the “Eszett”) as my delimiter, and have a parameter to allow it to be omitted if necessary.

What this allows you to do is relatively easily store a large set of known n-grams and their occurrence frequencies (i.e. how common they are), and then predict either what word the user is likely to type next (in the case of an auto-suggestion system), or what the user actually meant to type in the case of a misspelling.

For example, if a user types “ipod” followed by a space into the search box, then given a large enough working dataset you should be able to predict what the next word is likely to be (even more so as they start typing the first few letters). Or, if the same user types in “ipod nqno“, then we should be able to work out that firstly “nqno” isn’t a common word according to our list of unigrams, and secondly that the word “ipod” followed by a space and “n” is most likely meant to be “ipod nano” – and we can then recommend that as an alternative search.

Obviously this is very high level, and there are many, many other factors and variables to take into account (which is what makes it interesting!) – but it gives you a little background to the original point of this blog post; sharing my bigram function. 🙂

The guts

It’s relatively self explanatory, and would likely be rewritten as a CLR function before going into production, but having a TSQL function allows me to test my theory myself before asking a developer to spend any time on it.
```
-- =============================================
-- Author:		David Curlewis
-- Create date: 2011-09-08
-- Description:	Splits a string and returns word-level bigrams (i.e. all pairs of words making up the string)
-- =============================================
CREATE FUNCTION dbo.fn_word_bigrams (
    @string             NVARCHAR(1000),
    @include_delimiters BIT = 1 -- whether to include rows for prefix and suffix delimiters
)
RETURNS @bigrams TABLE (
    row_id  INT, 
    word1   NVARCHAR(100),
    word2   NVARCHAR(100) 
)
AS
BEGIN
    DECLARE @rowcount   INT,
            @i          INT;

    DECLARE @tt_words   TABLE (row_id INT, word NVARCHAR(100))
    DECLARE @tt_bigrams TABLE (word1 NVARCHAR(100), word2 NVARCHAR(100))

    SET @i = 1;
    SET @string = ltrim(rtrim(@string)) 
        + CASE @include_delimiters WHEN 1 THEN ' ß' ELSE '' END;

    WITH split_words AS
    (
        SELECT  ROW_NUMBER() OVER(ORDER BY @@SPID) AS row_id,
                item AS word
        FROM    dbo.SplitString_Multi(@string, ' ')
    )
    , bigrams AS 
    (
        SELECT  row_id,
                CAST('ß' AS NVARCHAR(4000)) AS word1, -- ß is our delimiter
                CAST(word AS NVARCHAR(4000)) AS word2
        FROM    split_words
        WHERE   row_id = 1
        UNION ALL
        SELECT  b.row_id + 1 AS row_id,
                CAST(b.word2 AS NVARCHAR(4000)) AS word1, 
                CAST(w.word AS NVARCHAR(4000)) AS word2
        FROM    split_words AS w
        JOIN    bigrams AS b ON w.row_id = b.row_id + 1
    )
    INSERT  @bigrams
    SELECT  row_id,
            word1,
            word2
    FROM    bigrams
    WHERE   row_id &amp;gt; CASE @include_delimiters WHEN 1 THEN 0 ELSE 1 END;
    
    RETURN; 
END
```
Shout if you have any questions or comments.

Cheers,

DB Dave
2011-09-08

Full-Text Search Operators – Part 4: Wrapping it up

"iPod nano 4Gb wrapping" by Darkmere — That’s a wrap!

Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators
Part 3: Splitting the search string

Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two. I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!

Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.

NEAR enough…

“NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.

In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them. Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.

-------------------------------------------------------------------------------------------------
-- "NEAR" OPERATOR HANDLER
-------------------------------------------------------------------------------------------------
IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near')
BEGIN
-- reinitialise
SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL;
-- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...)
DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR
SELECT rowid
FROM @tt_searchwords
WHERE REPLACE(operator, '`', '') = 'op_near';

OPEN near_string;
FETCH NEXT FROM near_string INTO @i;
WHILE @@FETCH_STATUS = 0
BEGIN
-- get the rowid of the value before and after the NEAR operator row
SELECT @o = MAX(rowid)
FROM @tt_searchwords
WHERE rowid &lt; @i
AND operator NOT LIKE 'paren%';

SELECT @oo = MIN(rowid)
FROM @tt_searchwords
WHERE rowid > @i
AND operator NOT LIKE 'paren%';

-- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option
IF (@o IS NULL) OR (@oo IS NULL)
DELETE @tt_searchwords
WHERE rowid = @i;
ELSE BEGIN
-- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator
SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END
FROM @tt_searchwords
WHERE rowid = @o;

SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')'
FROM @tt_searchwords
WHERE rowid = @oo;

-- update the NEAR operator record with the concatenated search term
UPDATE @tt_searchwords
SET word = @current_word,
operator = 'searchnear'
WHERE rowid = @i;

-- delete the records before and after the NEAR operator
UPDATE @tt_searchwords
SET operator = 'searchword_near'
WHERE rowid IN (@o, @oo);
END

FETCH NEXT FROM near_string INTO @i;
END

CLOSE near_string;
DEALLOCATE near_string;

-- delete search words that are now included IN "NEAR phrases"
DELETE @tt_searchwords
WHERE operator = 'searchword_near';
END

It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!

Builderer of awesome

This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together. It’s not pretty, but it works.

-------------------------------------------------------------------------------------------------
-- Final command string builderer of awesome
-------------------------------------------------------------------------------------------------
-- Need to quote any search terms that contain certain non-alphanumeric characters
UPDATE @tt_searchwords
SET word = CASE
WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"'
WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"'
WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"'
ELSE word END
WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted
-- reinitialise
SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = '';

-- get the range of row IDs that represent valid search terms and operators
SELECT @i = MIN(rowid),
@ii = MAX(rowid)
FROM @tt_searchwords
WHERE operator LIKE 'search%'
OR operator LIKE 'paren%';

WHILE @i &lt;= @ii
BEGIN
SELECT @final_search_temp =
CASE REPLACE(operator, '`', '')
WHEN 'op_not' THEN ' AND NOT '
WHEN 'op_or' THEN ' OR '
WHEN 'paren_o' THEN ' AND ('
WHEN 'paren_x' THEN ')'
ELSE ' AND '
END
+ CASE REPLACE(operator, '`', '')
WHEN 'op_not' THEN '' -- NOT
WHEN 'op_or' THEN '' -- OR
WHEN 'paren_o' THEN '' -- (
WHEN 'paren_x' THEN '' -- )
WHEN 'searchnear' THEN word -- NEAR
WHEN 'searchphrase' THEN word -- "search phrase"
WHEN 'searchexplicit' THEN word -- explicit search
WHEN 'searchprefix' THEN word -- wildcard
ELSE REPLACE(@word_syntax, '&lt;PLACEHOLDER/>', word)-- AND (default)
END
FROM @tt_searchwords
WHERE rowid = @i;

-- CONTAINSTABLE only accepts up to 4000 characters
IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000
BREAK
ELSE
SET @final_search = @final_search + @final_search_temp;

IF @i = @ii BREAK;

SELECT @i = MIN(rowid)
FROM @tt_searchwords
WHERE rowid > @i;
END

-- sort out issues of operator double-ups, etc
-- clunky but it works
SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT ');
SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR ');
SET @final_search = '(' + @final_search + ')';
SET @final_search = REPLACE(@final_search, '( AND NOT ', '(');
SET @final_search = REPLACE(@final_search, ' AND NOT )', ')');
SET @final_search = REPLACE(@final_search, '( AND ', '(');
SET @final_search = REPLACE(@final_search, ' AND )', ')');
SET @final_search = REPLACE(@final_search, '( OR ', '(');
SET @final_search = REPLACE(@final_search, ' OR )', ')');
SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);

That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.

Cheers
DB Dave

2011-07-26

Full-Text Search Operators – Part 3: Splitting the search string

Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators

We’ve now got to the point where we have a “clean” search string. The next step is to split the string into rows which we can then manipulate without having to traverse the string itself.

There are many “split to row” table-valued functions out there, and I’d recommend using a CLR-based one if you can (for performance reasons). Basically the function accepts an input string and a delimiter, and outputs a table of individual search terms.

Instead of simply inserting these rows into a temporary table or table variable at this stage, I first use a series of CTE’s to further refine the data.

-------------------------------------------------------------------------------------------------
-- SPLIT SEARCH STRING INTO INDIVIDUAL SEARCH WORDS
-------------------------------------------------------------------------------------------------
;WITH
banana_split AS
(
SELECT id AS rowid,
LTRIM(RTRIM(data)) as word
FROM [dbo].[fn_split_to_rows](@search_text, ' ') -- whatever split method you use, make sure it retains the case if you rely on UPPER-CASE operators
WHERE LEN(LTRIM(RTRIM(data))) BETWEEN 1 AND 100 -- any single "word" longer than 100 chars is likely rubbish (http://en.wikipedia.org/wiki/Longest_word_in_English) ;-)
),

This then feeds into a 2nd CTE which classifies each individual search term:

words_classified AS
(
-- ********************************************************************************************
-- The order of the CASE clauses below is important in deciding the order of preference
-- given to the different special characters. For example, if a string contains both an
-- asterisk and double quotes (e.g. "search*term") then the operator that appears first in
-- the CASE statement below will take affect (e.g. either "search term" or search*).
-- ********************************************************************************************
SELECT rowid,
word,
CASE
WHEN word = '(' THEN 'paren_o' -- Opening parenthesis
WHEN word = ')' THEN 'paren_x' -- Closing parenthesis
WHEN LEFT(word,1) = '"' OR LEFT(word,1) = '''' THEN 'search_dq_o' -- An opening exact phrase-quote
WHEN RIGHT(word,1) = '"' OR RIGHT(word,1) = '''' THEN 'search_dq_x' -- A closing exact phrase-quote
WHEN LEFT(word,1) = '"' AND RIGHT(word,1) = '"' THEN 'searchexplicit' -- An exact term search (i.e. "blah")
WHEN LEFT(word,1) = '''' AND RIGHT(word,1) = '''' THEN 'searchexplicit' -- An exact term search (i.e. 'blah')
WHEN LEFT(word,1) = '+' THEN 'searchexplicit' -- An exact term search (i.e. +blah)
-- We explicitly use a case-sensitive (binary) collation to ensure case sensitivity for operators.
WHEN (word COLLATE Latin1_General_BIN2) = ('AND' COLLATE Latin1_General_BIN2) THEN 'op_and' -- AND operator
WHEN (word COLLATE Latin1_General_BIN2) = ('OR' COLLATE Latin1_General_BIN2) THEN 'op_or' -- OR operator
WHEN (word COLLATE Latin1_General_BIN2) = ('NEAR' COLLATE Latin1_General_BIN2) THEN 'op_near' -- NEAR operator
WHEN (word COLLATE Latin1_General_BIN2) = ('NOT' COLLATE Latin1_General_BIN2) THEN 'op_not' -- NOT operator
WHEN word = '`AND`' THEN 'op_and`' -- AND operator (&)
WHEN word = '`OR`' THEN 'op_or`' -- OR operator (|)
WHEN word = '`NEAR`' THEN 'op_near`' -- NEAR operator (~)
WHEN word = '`NOT`' THEN 'op_not`' -- NOT operator (-)
WHEN CHARINDEX('*', word) > 0 THEN 'searchprefix' -- Wildcard operator (*)
ELSE 'searchword' END AS operator
FROM banana_split
),

I wanted the search operators to be case sensitive in order not to change the behaviour of searches where people were including the words “and”, “or”, “not”, or “near”. This means that I needed to specify a case-sensitive collation for the string comparisons (obviously you can remove this if you don’t want case-sensitivity, or you database already uses a case-sensitive collation).

Now another CTE tidies up by removing some superfluous characters, etc.:

-------------------------------------------------------------------------------------------------
-- NOW TIDY UP THE WORDS AND OPERATORS
-------------------------------------------------------------------------------------------------
tidy_up_round1 AS
(
-- tidy up some of the words
SELECT rowid,
REPLACE(
CASE
WHEN operator = 'searchexplicit' THEN REPLACE(word, '+', '') -- remove +'s FROM explicit searches
WHEN operator = 'search_dq_o' THEN RIGHT(word, LEN(word)-1) -- remove the opening quote
WHEN operator = 'search_dq_x' THEN LEFT(word, LEN(word)-1) -- remove the closing quote
ELSE word END,
'"', '') AS word, -- we can just kill all double-quotes now (any that are in the middle of words are not valid)
operator
FROM words_classified
WHERE operator NOT IN ('op_and', 'op_and`') -- remove "AND" operators completely since these are the default inter-term operators anyway
AND (PATINDEX('%[0-9a-z]%', word) > 0 OR operator LIKE 'paren%') -- only words which have 1 or more alphanumeric chars in them (unless its a parenthesis)
),
tidy_up_round2 AS
(
SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
word,
operator
FROM tidy_up_round1
WHERE rowid BETWEEN -- exclude any operators that are incorrectly located at the beginnning or end of the string.
(SELECT MIN(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
AND (SELECT MAX(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
),

Next, remove consecutive & duplicate operators…

-------------------------------------------------------------------------------------------------
-- REMOVE CONSECUTIVE OPERATORS (LIKE "OR NOT" OR "NEAR NEAR", WHICH AREN'T VALID).
-------------------------------------------------------------------------------------------------
de_duped AS
(
SELECT a.rowid,
a.word,
a.operator
FROM tidy_up_round2 a
LEFT JOIN tidy_up_round2 b
ON a.rowid = (b.rowid + 1) -- consecutive rowid's
AND a.operator LIKE 'op%' -- only operators
AND b.operator LIKE 'op%' -- only operators
WHERE b.rowid IS NULL
),

…and exclude any empty or unnecessary parentheses…

-------------------------------------------------------------------------------------------------
-- GET PARENTHESIS GROUPS, AND THEN THE EMPTY/UNNECESSARY PAIRS (i.e. "()" or "(blah)")
-------------------------------------------------------------------------------------------------
parentheses_groups AS
(
SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS prowid,
rowid,
word,
operator
FROM de_duped
WHERE operator IN ('paren_o', 'paren_x')
OR operator LIKE 'search%'
),
empty_pairs AS
(
SELECT a.rowid AS rowid_o,
b.rowid AS rowid_x
FROM parentheses_groups a
JOIN parentheses_groups b
ON a.operator = 'paren_o' -- opening parenthesis
AND b.operator = 'paren_x' -- closing parenthesis
AND (
a.prowid = (b.prowid + 1) -- where opening and closing are consecutive rows (i.e. empty)
OR a.prowid = (b.prowid + 2) -- where opening and closing are 1 row apart (i.e. they surround a single search term - this is unnecessary)
)
),
-------------------------------------------------------------------------------------------------
-- NOW EXCLUDE EMPTY PARENTHESES GROUPS FROM THE FINAL RESULT SET
-------------------------------------------------------------------------------------------------
no_empties AS
(
SELECT a.rowid,
a.word,
a.operator
FROM de_duped a
LEFT JOIN empty_pairs x
ON (a.rowid = x.rowid_o OR a.rowid = x.rowid_x)
WHERE x.rowid_o IS NULL
),

And lastly, format any wildcard (i.e. prefix) search terms, and insert the results into a table variable:

-------------------------------------------------------------------------------------------------
-- WILDCARD HANDLER
-------------------------------------------------------------------------------------------------
wildcard_handler AS
(
SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
CASE operator
WHEN 'searchprefix' THEN
'"' + CASE
WHEN (LEFT(word, 1) <> '*') AND (RIGHT(word, 1) <> '*') THEN LEFT(word, CHARINDEX('*', word)-1)
ELSE REPLACE(word, '*', '')
END + '*"'
ELSE word
END AS word,
operator
FROM no_empties
)
-------------------------------------------------------------------------------------------------
-- FINALLY, INSERT THE RESULTS INTO A TABLE VARIABLE
-------------------------------------------------------------------------------------------------
INSERT @tt_searchwords (rowid, word, operator)
SELECT rowid, word, operator
FROM wildcard_handler;

Phew. Enough for this post I think.

So, at this point if I pass in the following search string – sql NEAR server -(oracle | mysql) – then my table variable should be populated with the following:

In the next (and final) post, I’ll wrap up the “near” operator & double-quote handlers, and the building of the final output string.

See ya!

DB Dave

2011-06-10

Full-Text Search Operators – Part 2: Replacing alternate operators
In this post I’m going to deal with another piece of the puzzle that I described in part 1. Now, after re-reading that post it became obvious that I probably need to outline what the end result is meant to look like:

We want to be able to pass in a search string, and get back a nicely formatted full-text search clause which makes full use of Boolean operators, respects parentheses, and supports wildcards. Easy, right? Yup, and we need to do it all using only T/SQL.

Here’s the steps that make up the final function (in a nutshell):
1. Remove some dodgy characters (vertical-tab, form-feed, etc)
2. Tidy up any parentheses [part 1]
3. Replace any alternate operators with place-holders (most operators have 2 representations; e.g. OR and |, AND and &, NOT and –, etc).
4. Break the search term into individual words (i.e. split the string on space characters)
5. Using this list of individual words, do the following:
  1. Classify each word as an operator, search term, parenthesis, wildcard, etc.
  2. Tidy up the list, removing illegal and unnecessary characters
  3. Remove invalid operators (e.g. “ipod NOT” is not a valid search term)
  4. Remove unnecessary parentheses (e.g. single words wrapped in parentheses)
  5. Handle wildcards. Since full-text only accepts prefix searches, change all invalid wildcard searches into prefix searches.
6. Check for single and double-quotes (& tidy up any invalid quote-marks, make sure only valid pairs remain, etc.).
7. Process any NEAR(~) operators.
8. Put the final Full-Text command string together.
Right, so we’ve already taken care of number 2 in the previous post, so now on to number 3 – replacing the alternate operators. This step may not be necessary for you, since it’s sole purpose is to preserve the original search string for our reporting purposes.

You see, if I search for “ipad OR iphone NOT ipod” and you search for “ipad | iphone –ipod”, then obviously since they’re really the same search we will get the same results (and our cache should also cache them as 1 search), but for reporting purposes we want to know that our searches were typed in differently. The way we do that in our implementation of this function, is that we return a normalised search string used to perform & cache the search, and an original string used for reporting.

Here’s the very straight forward code:
```
IF OBJECT_ID('[dbo].[fn_search_clause_operator_replacer]') IS NULL
EXEC('CREATE FUNCTION [dbo].[fn_search_clause_operator_replacer] () RETURNS INT AS BEGIN RETURN(0); END ');
GO
/***************************************************************************************************************************
This is a supporting function, which is called by the function [fn_search_clause_get].
It accepts a string input, and outputs the same string, but with any character operators replaced with
placeholder text-operators instead.
Author : David Curlewis
Date : 07/2009

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
http://creativecommons.org/licenses/by-nc/3.0/
****************************************************************************************************************************/
ALTER FUNCTION [dbo].[fn_search_clause_operator_replacer] (
@search_text NVARCHAR(4000)
) RETURNS NVARCHAR(4000)
AS
BEGIN
DECLARE @not_pattern VARCHAR(50),
@not_pos INT;

SET @search_text = REPLACE(REPLACE(REPLACE(@search_text,
'|', ' `OR` '), -- OR
'~', ' `NEAR` '), -- NEAR
'&', ' `AND` '); -- AND

-- the NOT operator requires some extra work because we only consider it an operator if it is not hyphenating a word
SET @not_pattern = '%[^a-z0-9][-][ a-z0-9"''(]%'; -- valid NOT operator pattern
SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get the positions of any valid "-" operators
WHILE (@not_pos > 0)
BEGIN
SET @search_text = STUFF(@search_text, @not_pos, 2, ' `NOT` ');
SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get pos of next operators (if any)
END

RETURN(@search_text);
END
GO
```
The first bit is very simple; it’s just a REPLACE statement replacing each operator’s “symbol operator” (i.e. |, ~, &) with a bit of placeholder text (i.e. `OR`, `NEAR`, `AND`).
The NOT operators require a little more work because we don’t want to interpret all hyphens as operators, otherwise you wouldn’t be able to search for hyphenated words. Using PATINDEX we can very easily match on a simple RegEx-like pattern: [^a-z0-9][-][ a-z0-9″”(]

This looks for any occurrences of the hyphen, where it follows a non-alphanumeric character, and precedes a space, an alphanumeric character, a double or single quote, or an opening parenthesis. It then replaces each occurrence with the placeholder text `NOT`.

In the next post I’ll go through the meatiest bit, which involves splitting the string (number 4), and preparing the resulting data (number 5).

Cheers

DB Dave
2011-06-01

Full-Text Search Operators – Part 1: Cleaning up parentheses in a string

Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators
Part 3: Splitting the search string
Part 4: Wrapping it up

A few years ago I wanted to have a crack at improving one of our websites’ search by providing “search operator” functionality. These are basically special words or characters (i.e. operators) which provide Boolean search capability; e.g. AND, OR, NOT.

I looked around but couldn’t find anything implemented using T/SQL. There were a few .Net solutions, but I’m not a developer, and can only handle very basic .Net on a good day. So I had a crack at writing my own. I started off with the Google cheat-sheet as my initial list of requirements, which was basically as follows:

Must support the operators AND, OR, NOT, and NEAR
Must allow for “exact-phrase” searches
Should respect parentheses (for precedence control)
Should be able to perform wildcard searches
Operators must be case-sensitive (i.e. OR != or)

To attack this problem, I decided to split it into multiple chunks, and write separate blocks of code to meet each requirements. In this post I’m going to introduce the first piece of the puzzle (which is number 3 from the list above; handling parentheses). What this means is that given a string containing parentheses, I need to output the same string with valid parentheses intact (and obviously invalid ones removed).

Below is the function. I won’t explain it line by line since its pretty well commented, and relatively simple anyway.

IF OBJECT_ID('[dbo].[fn_search_clause_parentheses_handler]') IS NULL
EXEC('CREATE FUNCTION [dbo].[fn_search_clause_parentheses_handler] () RETURNS INT AS BEGIN RETURN(0); END ');
GO
/***************************************************************************************************************************
This is a supporting function, which is called by the function [fn_search_clause_get].
It accepts a string input, and outputs the same string, but with any invalid parentheses removed, and all remaining
valid parentheses buffered (by a space on each side) so that they are split correctly by the calling function.
Author : David Curlewis
Date : 07/2009

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
http://creativecommons.org/licenses/by-nc/3.0/
****************************************************************************************************************************/
ALTER FUNCTION [dbo].[fn_search_clause_parentheses_handler] (
@search_text NVARCHAR(4000)
) RETURNS NVARCHAR(4000)
AS
BEGIN
-------------------------------------------------------------------------------------------------
-- PARENTHESES HANDLER
-------------------------------------------------------------------------------------------------
-- IF parentheses exist in the string
IF (PATINDEX('%[()]%', @search_text) > 0)
BEGIN
DECLARE @pos_o INT, @pos_x INT, @charpos INT;
DECLARE @tt_char_split TABLE (id SMALLINT IDENTITY(1,1), [char] VARCHAR(2) NOT NULL);
SET @charpos = 1;

-- split the string apart into a temp table
WHILE @charpos <= LEN(@search_text)
BEGIN
INSERT @tt_char_split ([char])
SELECT SUBSTRING(@search_text, @charpos, 1);

SET @charpos = @charpos + 1;
END

-- while we have opening and closing parentheses
WHILE EXISTS(SELECT TOP 1 1 FROM @tt_char_split WHERE [char] = '(')
AND EXISTS(SELECT TOP 1 1 FROM @tt_char_split WHERE [char] = ')')
BEGIN
-- Get the position of the first closing parenthesis
SET @pos_x = ( SELECT MIN(id)
FROM @tt_char_split
WHERE [char] = ')');

-- Get the position of the first opening parenthesis
SET @pos_o = ( SELECT MAX(id)
FROM @tt_char_split
WHERE [char] = '('
AND id < @pos_x);

-- there is a valid pair of parentheses
IF (@pos_o IS NOT NULL AND @pos_x IS NOT NULL)
BEGIN
-- Escape this pair so we know they've been processed
UPDATE @tt_char_split
SET [char] = '(('
WHERE id = @pos_o;

UPDATE @tt_char_split
SET [char] = '))'
WHERE id = @pos_x;
END
ELSE BEGIN -- there is not a valid pair of parentheses
UPDATE @tt_char_split
SET [char] = ''
WHERE id IN (@pos_o, @pos_x);
END
END

-- remove any remaining (invalid) parentheses
UPDATE @tt_char_split
SET [char] = ''
WHERE [char] IN ('(', ')');

SET @search_text = '';

-- build new search string
SELECT @search_text = @search_text + REPLACE(REPLACE([char], '((', ' ( '), '))', ' ) ')
FROM @tt_char_split
ORDER BY id;

-- remove empty pairs, i.e. " ( ) "
WHILE CHARINDEX(' ( ) ', @search_text) > 0
SET @search_text = REPLACE(@search_text, ' ( ) ', '');
END

RETURN(@search_text);
END
GO

So, its a scalar function which accepts a string, and returns that same string, but with it’s parentheses tidied up and “buffered” (i.e. I wrap them in spaces to make it easier for the calling function to then split the string into its component parts).

Now, as I sit here watching the latest straight-to-dvd masterpiece that is “Barbie: A fashion fairy-tale” (hey, its a Sunday afternoon and I have 3 daughters – you’d lose the battle too!), I’m having a hard time getting the code to post properly, so if you have any issues, flick me a note in the comments below and I’ll sort something else out.

Give it a try and let me know if you find any bugs (or make any improvements!). I’ll follow up soon with posts describing the rest of the process, but for now I’ve had enough of Barbie, and think I deserve a whisky.

Cheers

DB Dave

2011-05-29

Word 1	Word 2
#	pink
pink	ipod
ipod	nano
nano	#

Tag: search

SQL Server word-level bigram function

Did you mean…?

The guts

Full-Text Search Operators – Part 4: Wrapping it up

NEAR enough…

Builderer of awesome

Full-Text Search Operators – Part 3: Splitting the search string

Full-Text Search Operators – Part 2: Replacing alternate operators

Full-Text Search Operators – Part 1: Cleaning up parentheses in a string