Tag: search

  • SQL Server word-level bigram function

    I work for an online auction website, Trade Me, and as such collect quite a large quantity of search data (i.e. full text searches that people perform).  Over the years I’ve been involved in various aspects of our search; from tweaking the nuts and bolts of the full-text search itself (see my blog posts about boolean search operators for example) to working on our recommendation algorithm, and implementing search auto-suggest.

    Something I’m looking into at the moment is changing the way we recommend alternative search terms when we think you’ve misspelt something – i.e. our “did-you-mean” system.

    Did you mean…?

    The way we currently do it is based on taking the individual words entered, and generating a large number of alternative spellings for each of them, and then comparing those lists of words against a “known-good” list of search terms and returning the most likely one for each word entered (according to a magic score based on frequency, average results per search, etc).  This is done on the fly and is surprisingly quick (it’s all done within TSQL functions and procedures), but I want to experiment with word level n-grams (unigrams, bigrams, and trigrams specifically).

    Basically, this involves taking each search string and splitting it into individual words to give you your unigrams, or then pairing them up (in the case of bigrams).

    For example, the string “pink ipod nano” would be returned as:

    Word 1Word 2
    #pink
    pinkipod
    ipodnano
    nano#

    The hash symbols are delimiters and are sometimes not used, in which case the 1st and last rows of this table wouldn’t be returned. In my function below, I use a less common symbol (the “Eszett”) as my delimiter, and have a parameter to allow it to be omitted if necessary.

    What this allows you to do is relatively easily store a large set of known n-grams and their occurrence frequencies (i.e. how common they are), and then predict either what word the user is likely to type next (in the case of an auto-suggestion system), or what the user actually meant to type in the case of a misspelling.

    For example, if a user types “ipod” followed by a space into the search box, then given a large enough working dataset you should be able to predict what the next word is likely to be (even more so as they start typing the first few letters).  Or, if the same user types in “ipod nqno“, then we should be able to work out that firstly “nqno” isn’t a common word according to our list of unigrams, and secondly that the word “ipod” followed by a space and “n” is most likely meant to be “ipod nano” – and we can then recommend that as an alternative search.

    Obviously this is very high level, and there are many, many other factors and variables to take into account (which is what makes it interesting!) – but it gives you a little background to the original point of this blog post; sharing my bigram function. 🙂

    The guts

    It’s relatively self explanatory, and would likely be rewritten as a CLR function before going into production, but having a TSQL function allows me to test my theory myself before asking a developer to spend any time on it.

    -- =============================================
    -- Author:		David Curlewis
    -- Create date: 2011-09-08
    -- Description:	Splits a string and returns word-level bigrams (i.e. all pairs of words making up the string)
    -- =============================================
    CREATE FUNCTION dbo.fn_word_bigrams (
        @string             NVARCHAR(1000),
        @include_delimiters BIT = 1 -- whether to include rows for prefix and suffix delimiters
    )
    RETURNS @bigrams TABLE (
        row_id  INT, 
        word1   NVARCHAR(100),
        word2   NVARCHAR(100) 
    )
    AS
    BEGIN
        DECLARE @rowcount   INT,
                @i          INT;
    
        DECLARE @tt_words   TABLE (row_id INT, word NVARCHAR(100))
        DECLARE @tt_bigrams TABLE (word1 NVARCHAR(100), word2 NVARCHAR(100))
    
        SET @i = 1;
        SET @string = ltrim(rtrim(@string)) 
            + CASE @include_delimiters WHEN 1 THEN ' ß' ELSE '' END;
    
        WITH split_words AS
        (
            SELECT  ROW_NUMBER() OVER(ORDER BY @@SPID) AS row_id,
                    item AS word
            FROM    dbo.SplitString_Multi(@string, ' ')
        )
        , bigrams AS 
        (
            SELECT  row_id,
                    CAST('ß' AS NVARCHAR(4000)) AS word1, -- ß is our delimiter
                    CAST(word AS NVARCHAR(4000)) AS word2
            FROM    split_words
            WHERE   row_id = 1
            UNION ALL
            SELECT  b.row_id + 1 AS row_id,
                    CAST(b.word2 AS NVARCHAR(4000)) AS word1, 
                    CAST(w.word AS NVARCHAR(4000)) AS word2
            FROM    split_words AS w
            JOIN    bigrams AS b ON w.row_id = b.row_id + 1
        )
        INSERT  @bigrams
        SELECT  row_id,
                word1,
                word2
        FROM    bigrams
        WHERE   row_id > CASE @include_delimiters WHEN 1 THEN 0 ELSE 1 END;
        
        RETURN; 
    END

    Shout if you have any questions or comments.

    Cheers,

    DB Dave

  • Full-Text Search Operators – Part 4: Wrapping it up

    "iPod nano 4Gb wrapping" by Darkmere
    That’s a wrap!

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators
    Part 3: Splitting the search string

    Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two.  I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!

    Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.

    NEAR enough…

    “NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.

    In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them.  Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
    So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.

    -------------------------------------------------------------------------------------------------
    -- "NEAR" OPERATOR HANDLER
    -------------------------------------------------------------------------------------------------
    IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near')
    BEGIN
    -- reinitialise
    SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL;
    -- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...)
    DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR
    SELECT rowid
    FROM @tt_searchwords
    WHERE REPLACE(operator, '`', '') = 'op_near';
    
    OPEN near_string;
    FETCH NEXT FROM near_string INTO @i;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    -- get the rowid of the value before and after the NEAR operator row
    SELECT @o = MAX(rowid)
    FROM @tt_searchwords
    WHERE rowid < @i
    AND operator NOT LIKE 'paren%';
    
    SELECT @oo = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i
    AND operator NOT LIKE 'paren%';
    
    -- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option
    IF (@o IS NULL) OR (@oo IS NULL)
    DELETE @tt_searchwords
    WHERE rowid = @i;
    ELSE BEGIN
    -- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator
    SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END
    FROM @tt_searchwords
    WHERE rowid = @o;
    
    SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')'
    FROM @tt_searchwords
    WHERE rowid = @oo;
    
    -- update the NEAR operator record with the concatenated search term
    UPDATE @tt_searchwords
    SET word = @current_word,
    operator = 'searchnear'
    WHERE rowid = @i;
    
    -- delete the records before and after the NEAR operator
    UPDATE @tt_searchwords
    SET operator = 'searchword_near'
    WHERE rowid IN (@o, @oo);
    END
    
    FETCH NEXT FROM near_string INTO @i;
    END
    
    CLOSE near_string;
    DEALLOCATE near_string;
    
    -- delete search words that are now included IN "NEAR phrases"
    DELETE @tt_searchwords
    WHERE operator = 'searchword_near';
    END

    It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!

    Builderer of awesome

    This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together.  It’s not pretty, but it works.

    -------------------------------------------------------------------------------------------------
    -- Final command string builderer of awesome
    -------------------------------------------------------------------------------------------------
    -- Need to quote any search terms that contain certain non-alphanumeric characters
    UPDATE @tt_searchwords
    SET word = CASE
    WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"'
    ELSE word END
    WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted
    -- reinitialise
    SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = '';
    
    -- get the range of row IDs that represent valid search terms and operators
    SELECT @i = MIN(rowid),
    @ii = MAX(rowid)
    FROM @tt_searchwords
    WHERE operator LIKE 'search%'
    OR operator LIKE 'paren%';
    
    WHILE @i <= @ii
    BEGIN
    SELECT @final_search_temp =
    CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN ' AND NOT '
    WHEN 'op_or' THEN ' OR '
    WHEN 'paren_o' THEN ' AND ('
    WHEN 'paren_x' THEN ')'
    ELSE ' AND '
    END
    + CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN '' -- NOT
    WHEN 'op_or' THEN '' -- OR
    WHEN 'paren_o' THEN '' -- (
    WHEN 'paren_x' THEN '' -- )
    WHEN 'searchnear' THEN word -- NEAR
    WHEN 'searchphrase' THEN word -- "search phrase"
    WHEN 'searchexplicit' THEN word -- explicit search
    WHEN 'searchprefix' THEN word -- wildcard
    ELSE REPLACE(@word_syntax, '<PLACEHOLDER/>', word)-- AND (default)
    END
    FROM @tt_searchwords
    WHERE rowid = @i;
    
    -- CONTAINSTABLE only accepts up to 4000 characters
    IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000
    BREAK
    ELSE
    SET @final_search = @final_search + @final_search_temp;
    
    IF @i = @ii BREAK;
    
    SELECT @i = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i;
    END
    
    -- sort out issues of operator double-ups, etc
    -- clunky but it works
    SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT ');
    SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR ');
    SET @final_search = '(' + @final_search + ')';
    SET @final_search = REPLACE(@final_search, '( AND NOT ', '(');
    SET @final_search = REPLACE(@final_search, ' AND NOT )', ')');
    SET @final_search = REPLACE(@final_search, '( AND ', '(');
    SET @final_search = REPLACE(@final_search, ' AND )', ')');
    SET @final_search = REPLACE(@final_search, '( OR ', '(');
    SET @final_search = REPLACE(@final_search, ' OR )', ')');
    SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);

    That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.

    Cheers
    DB Dave

  • Full-Text Search Operators – Part 3: Splitting the search string

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators

    We’ve now got to the point where we have a “clean” search string.  The next step is to split the string into rows which we can then manipulate without having to traverse the string itself.

    There are many “split to row” table-valued functions out there, and I’d recommend using a CLR-based one if you can (for performance reasons).  Basically the function accepts an input string and a delimiter, and outputs a table of individual search terms.

    Instead of simply inserting these rows into a temporary table or table variable at this stage, I first use a series of CTE’s to further refine the data.

    -------------------------------------------------------------------------------------------------
    -- SPLIT SEARCH STRING INTO INDIVIDUAL SEARCH WORDS
    -------------------------------------------------------------------------------------------------
    ;WITH
    banana_split AS
    (
    SELECT id AS rowid,
    LTRIM(RTRIM(data)) as word
    FROM [dbo].[fn_split_to_rows](@search_text, ' ') -- whatever split method you use, make sure it retains the case if you rely on UPPER-CASE operators
    WHERE LEN(LTRIM(RTRIM(data))) BETWEEN 1 AND 100 -- any single "word" longer than 100 chars is likely rubbish (http://en.wikipedia.org/wiki/Longest_word_in_English) ;-)
    ),

    This then feeds into a 2nd CTE which classifies each individual search term:

    words_classified AS
    (
    -- ********************************************************************************************
    -- The order of the CASE clauses below is important in deciding the order of preference
    -- given to the different special characters. For example, if a string contains both an
    -- asterisk and double quotes (e.g. "search*term") then the operator that appears first in
    -- the CASE statement below will take affect (e.g. either "search term" or search*).
    -- ********************************************************************************************
    SELECT rowid,
    word,
    CASE
    WHEN word = '(' THEN 'paren_o' -- Opening parenthesis
    WHEN word = ')' THEN 'paren_x' -- Closing parenthesis
    WHEN LEFT(word,1) = '"' OR LEFT(word,1) = '''' THEN 'search_dq_o' -- An opening exact phrase-quote
    WHEN RIGHT(word,1) = '"' OR RIGHT(word,1) = '''' THEN 'search_dq_x' -- A closing exact phrase-quote
    WHEN LEFT(word,1) = '"' AND RIGHT(word,1) = '"' THEN 'searchexplicit' -- An exact term search (i.e. "blah")
    WHEN LEFT(word,1) = '''' AND RIGHT(word,1) = '''' THEN 'searchexplicit' -- An exact term search (i.e. 'blah')
    WHEN LEFT(word,1) = '+' THEN 'searchexplicit' -- An exact term search (i.e. +blah)
    -- We explicitly use a case-sensitive (binary) collation to ensure case sensitivity for operators.
    WHEN (word COLLATE Latin1_General_BIN2) = ('AND' COLLATE Latin1_General_BIN2) THEN 'op_and' -- AND operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('OR' COLLATE Latin1_General_BIN2) THEN 'op_or' -- OR operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('NEAR' COLLATE Latin1_General_BIN2) THEN 'op_near' -- NEAR operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('NOT' COLLATE Latin1_General_BIN2) THEN 'op_not' -- NOT operator
    WHEN word = '`AND`' THEN 'op_and`' -- AND operator (&)
    WHEN word = '`OR`' THEN 'op_or`' -- OR operator (|)
    WHEN word = '`NEAR`' THEN 'op_near`' -- NEAR operator (~)
    WHEN word = '`NOT`' THEN 'op_not`' -- NOT operator (-)
    WHEN CHARINDEX('*', word) > 0 THEN 'searchprefix' -- Wildcard operator (*)
    ELSE 'searchword' END AS operator
    FROM banana_split
    ),

    I wanted the search operators to be case sensitive in order not to change the behaviour of searches where people were including the words “and”, “or”, “not”, or “near”. This means that I needed to specify a case-sensitive collation for the string comparisons (obviously you can remove this if you don’t want case-sensitivity, or you database already uses a case-sensitive collation).

    Now another CTE tidies up by removing some superfluous characters, etc.:

    -------------------------------------------------------------------------------------------------
    -- NOW TIDY UP THE WORDS AND OPERATORS
    -------------------------------------------------------------------------------------------------
    tidy_up_round1 AS
    (
    -- tidy up some of the words
    SELECT rowid,
    REPLACE(
    CASE
    WHEN operator = 'searchexplicit' THEN REPLACE(word, '+', '') -- remove +'s FROM explicit searches
    WHEN operator = 'search_dq_o' THEN RIGHT(word, LEN(word)-1) -- remove the opening quote
    WHEN operator = 'search_dq_x' THEN LEFT(word, LEN(word)-1) -- remove the closing quote
    ELSE word END,
    '"', '') AS word, -- we can just kill all double-quotes now (any that are in the middle of words are not valid)
    operator
    FROM words_classified
    WHERE operator NOT IN ('op_and', 'op_and`') -- remove "AND" operators completely since these are the default inter-term operators anyway
    AND (PATINDEX('%[0-9a-z]%', word) > 0 OR operator LIKE 'paren%') -- only words which have 1 or more alphanumeric chars in them (unless its a parenthesis)
    ),
    tidy_up_round2 AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
    word,
    operator
    FROM tidy_up_round1
    WHERE rowid BETWEEN -- exclude any operators that are incorrectly located at the beginnning or end of the string.
    (SELECT MIN(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
    AND (SELECT MAX(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
    ),

    Next, remove consecutive & duplicate operators…

    -------------------------------------------------------------------------------------------------
    -- REMOVE CONSECUTIVE OPERATORS (LIKE "OR NOT" OR "NEAR NEAR", WHICH AREN'T VALID).
    -------------------------------------------------------------------------------------------------
    de_duped AS
    (
    SELECT a.rowid,
    a.word,
    a.operator
    FROM tidy_up_round2 a
    LEFT JOIN tidy_up_round2 b
    ON a.rowid = (b.rowid + 1) -- consecutive rowid's
    AND a.operator LIKE 'op%' -- only operators
    AND b.operator LIKE 'op%' -- only operators
    WHERE b.rowid IS NULL
    ),

    …and exclude any empty or unnecessary parentheses…

    -------------------------------------------------------------------------------------------------
    -- GET PARENTHESIS GROUPS, AND THEN THE EMPTY/UNNECESSARY PAIRS (i.e. "()" or "(blah)")
    -------------------------------------------------------------------------------------------------
    parentheses_groups AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS prowid,
    rowid,
    word,
    operator
    FROM de_duped
    WHERE operator IN ('paren_o', 'paren_x')
    OR operator LIKE 'search%'
    ),
    empty_pairs AS
    (
    SELECT a.rowid AS rowid_o,
    b.rowid AS rowid_x
    FROM parentheses_groups a
    JOIN parentheses_groups b
    ON a.operator = 'paren_o' -- opening parenthesis
    AND b.operator = 'paren_x' -- closing parenthesis
    AND (
    a.prowid = (b.prowid + 1) -- where opening and closing are consecutive rows (i.e. empty)
    OR a.prowid = (b.prowid + 2) -- where opening and closing are 1 row apart (i.e. they surround a single search term - this is unnecessary)
    )
    ),
    -------------------------------------------------------------------------------------------------
    -- NOW EXCLUDE EMPTY PARENTHESES GROUPS FROM THE FINAL RESULT SET
    -------------------------------------------------------------------------------------------------
    no_empties AS
    (
    SELECT a.rowid,
    a.word,
    a.operator
    FROM de_duped a
    LEFT JOIN empty_pairs x
    ON (a.rowid = x.rowid_o OR a.rowid = x.rowid_x)
    WHERE x.rowid_o IS NULL
    ),

    And lastly, format any wildcard (i.e. prefix) search terms, and insert the results into a table variable:

    -------------------------------------------------------------------------------------------------
    -- WILDCARD HANDLER
    -------------------------------------------------------------------------------------------------
    wildcard_handler AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
    CASE operator
    WHEN 'searchprefix' THEN
    '"' + CASE
    WHEN (LEFT(word, 1) <> '*') AND (RIGHT(word, 1) <> '*') THEN LEFT(word, CHARINDEX('*', word)-1)
    ELSE REPLACE(word, '*', '')
    END + '*"'
    ELSE word
    END AS word,
    operator
    FROM no_empties
    )
    -------------------------------------------------------------------------------------------------
    -- FINALLY, INSERT THE RESULTS INTO A TABLE VARIABLE
    -------------------------------------------------------------------------------------------------
    INSERT @tt_searchwords (rowid, word, operator)
    SELECT rowid, word, operator
    FROM wildcard_handler;

    Phew. Enough for this post I think.

    So, at this point if I pass in the following search string – sql NEAR server -(oracle | mysql)then my table variable should be populated with the following:

    Table variable contents

    In the next (and final) post, I’ll wrap up the “near” operator & double-quote handlers, and the building of the final output string.

    See ya!

    DB Dave

  • Full-Text Search Operators – Part 2: Replacing alternate operators

    Photo by natashalcdIn this post I’m going to deal with another piece of the puzzle that I described in part 1. Now, after re-reading that post it became obvious that I probably need to outline what the end result is meant to look like:

    We want to be able to pass in a search string, and get back a nicely formatted full-text search clause which makes full use of Boolean operators, respects parentheses, and supports wildcards. Easy, right? Yup, and we need to do it all using only T/SQL.

    Here’s the steps that make up the final function (in a nutshell):

    1. Remove some dodgy characters (vertical-tab, form-feed, etc)
    2. Tidy up any parentheses [part 1]
    3. Replace any alternate operators with place-holders (most operators have 2 representations; e.g. OR and |, AND and &, NOT and , etc).
    4. Break the search term into individual words (i.e. split the string on space characters)
    5. Using this list of individual words, do the following:
      1. Classify each word as an operator, search term, parenthesis, wildcard, etc.
      2. Tidy up the list, removing illegal and unnecessary characters
      3. Remove invalid operators (e.g. “ipod NOT” is not a valid search term)
      4. Remove unnecessary parentheses (e.g. single words wrapped in parentheses)
      5. Handle wildcards. Since full-text only accepts prefix searches, change all invalid wildcard searches into prefix searches.
    6. Check for single and double-quotes (& tidy up any invalid quote-marks, make sure only valid pairs remain, etc.).
    7. Process any NEAR(~) operators.
    8. Put the final Full-Text command string together.

    Right, so we’ve already taken care of number 2 in the previous post, so now on to number 3 – replacing the alternate operators.  This step may not be necessary for you, since it’s sole purpose is to preserve the original search string for our reporting purposes.

    You see, if I search for “ipad OR iphone NOT ipod” and you search for “ipad | iphone –ipod”, then obviously since they’re really the same search we will get the same results (and our cache should also cache them as 1 search), but for reporting purposes we want to know that our searches were typed in differently.  The way we do that in our implementation of this function, is that we return a normalised search string used to perform & cache the search, and an original string used for reporting.

    Here’s the very straight forward code:

    IF OBJECT_ID('[dbo].[fn_search_clause_operator_replacer]') IS NULL
    EXEC('CREATE FUNCTION [dbo].[fn_search_clause_operator_replacer] () RETURNS INT AS BEGIN RETURN(0); END ');
    GO
    /***************************************************************************************************************************
    This is a supporting function, which is called by the function [fn_search_clause_get].
    It accepts a string input, and outputs the same string, but with any character operators replaced with
    placeholder text-operators instead.
    Author : David Curlewis
    Date : 07/2009
    
    This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
    http://creativecommons.org/licenses/by-nc/3.0/
    ****************************************************************************************************************************/
    ALTER FUNCTION [dbo].[fn_search_clause_operator_replacer] (
    @search_text NVARCHAR(4000)
    ) RETURNS NVARCHAR(4000)
    AS
    BEGIN
    DECLARE @not_pattern VARCHAR(50),
    @not_pos INT;
    
    SET @search_text = REPLACE(REPLACE(REPLACE(@search_text,
    '|', ' `OR` '), -- OR
    '~', ' `NEAR` '), -- NEAR
    '&', ' `AND` '); -- AND
    
    -- the NOT operator requires some extra work because we only consider it an operator if it is not hyphenating a word
    SET @not_pattern = '%[^a-z0-9][-][ a-z0-9"''(]%'; -- valid NOT operator pattern
    SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get the positions of any valid "-" operators
    WHILE (@not_pos > 0)
    BEGIN
    SET @search_text = STUFF(@search_text, @not_pos, 2, ' `NOT` ');
    SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get pos of next operators (if any)
    END
    
    RETURN(@search_text);
    END
    GO

    The first bit is very simple; it’s just a REPLACE statement replacing each operator’s “symbol operator” (i.e. |, ~, &) with a bit of placeholder text (i.e. `OR`, `NEAR`, `AND`).

    The NOT operators require a little more work because we don’t want to interpret all hyphens as operators, otherwise you wouldn’t be able to search for hyphenated words. Using PATINDEX we can very easily match on a simple RegEx-like pattern: [^a-z0-9][-][ a-z0-9″”(]

    This looks for any occurrences of the hyphen, where it follows a non-alphanumeric character, and precedes a space, an alphanumeric character, a double or single quote, or an opening parenthesis. It then replaces each occurrence with the placeholder text `NOT`.

    In the next post I’ll go through the meatiest bit, which involves splitting the string (number 4), and preparing the resulting data (number 5).

    Cheers

    DB Dave

  • Full-Text Search Operators – Part 1: Cleaning up parentheses in a string

    BooleanBeards

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators
    Part 3: Splitting the search string
    Part 4: Wrapping it up

    A few years ago I wanted to have a crack at improving one of our websites’ search by providing “search operator” functionality.  These are basically special words or characters (i.e. operators) which provide Boolean search capability; e.g. AND, OR, NOT.

    I looked around but couldn’t find anything implemented using T/SQL.  There were a few .Net solutions, but I’m not a developer, and can only handle very basic .Net on a good day.  So I had a crack at writing my own. I started off with the Google cheat-sheet as my initial list of requirements, which was basically as follows:

    1. Must support the operators AND, OR, NOT, and NEAR
    2. Must allow for “exact-phrase” searches
    3. Should respect parentheses (for precedence control)
    4. Should be able to perform wildcard searches
    5. Operators must be case-sensitive (i.e. OR != or)

    To attack this problem, I decided to split it into multiple chunks, and write separate blocks of code to meet each requirements.  In this post I’m going to introduce the first piece of the puzzle (which is number 3 from the list above; handling parentheses).  What this means is that given a string containing parentheses, I need to output the same string with valid parentheses intact (and obviously invalid ones removed).

    Below is the function. I won’t explain it line by line since its pretty well commented, and relatively simple anyway.

    IF OBJECT_ID('[dbo].[fn_search_clause_parentheses_handler]') IS NULL
    EXEC('CREATE FUNCTION [dbo].[fn_search_clause_parentheses_handler] () RETURNS INT AS BEGIN RETURN(0); END ');
    GO
    /***************************************************************************************************************************
    This is a supporting function, which is called by the function [fn_search_clause_get].
    It accepts a string input, and outputs the same string, but with any invalid parentheses removed, and all remaining
    valid parentheses buffered (by a space on each side) so that they are split correctly by the calling function.
    Author : David Curlewis
    Date : 07/2009
    
    This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
    http://creativecommons.org/licenses/by-nc/3.0/
    ****************************************************************************************************************************/
    ALTER FUNCTION [dbo].[fn_search_clause_parentheses_handler] (
    @search_text NVARCHAR(4000)
    ) RETURNS NVARCHAR(4000)
    AS
    BEGIN
    -------------------------------------------------------------------------------------------------
    -- PARENTHESES HANDLER
    -------------------------------------------------------------------------------------------------
    -- IF parentheses exist in the string
    IF (PATINDEX('%[()]%', @search_text) > 0)
    BEGIN
    DECLARE @pos_o INT, @pos_x INT, @charpos INT;
    DECLARE @tt_char_split TABLE (id SMALLINT IDENTITY(1,1), [char] VARCHAR(2) NOT NULL);
    SET @charpos = 1;
    
    -- split the string apart into a temp table
    WHILE @charpos <= LEN(@search_text)
    BEGIN
    INSERT @tt_char_split ([char])
    SELECT SUBSTRING(@search_text, @charpos, 1);
    
    SET @charpos = @charpos + 1;
    END
    
    -- while we have opening and closing parentheses
    WHILE EXISTS(SELECT TOP 1 1 FROM @tt_char_split WHERE [char] = '(')
    AND EXISTS(SELECT TOP 1 1 FROM @tt_char_split WHERE [char] = ')')
    BEGIN
    -- Get the position of the first closing parenthesis
    SET @pos_x = ( SELECT MIN(id)
    FROM @tt_char_split
    WHERE [char] = ')');
    
    -- Get the position of the first opening parenthesis
    SET @pos_o = ( SELECT MAX(id)
    FROM @tt_char_split
    WHERE [char] = '('
    AND id < @pos_x);
    
    -- there is a valid pair of parentheses
    IF (@pos_o IS NOT NULL AND @pos_x IS NOT NULL)
    BEGIN
    -- Escape this pair so we know they've been processed
    UPDATE @tt_char_split
    SET [char] = '(('
    WHERE id = @pos_o;
    
    UPDATE @tt_char_split
    SET [char] = '))'
    WHERE id = @pos_x;
    END
    ELSE BEGIN -- there is not a valid pair of parentheses
    UPDATE @tt_char_split
    SET [char] = ''
    WHERE id IN (@pos_o, @pos_x);
    END
    END
    
    -- remove any remaining (invalid) parentheses
    UPDATE @tt_char_split
    SET [char] = ''
    WHERE [char] IN ('(', ')');
    
    SET @search_text = '';
    
    -- build new search string
    SELECT @search_text = @search_text + REPLACE(REPLACE([char], '((', ' ( '), '))', ' ) ')
    FROM @tt_char_split
    ORDER BY id;
    
    -- remove empty pairs, i.e. " ( ) "
    WHILE CHARINDEX(' ( ) ', @search_text) > 0
    SET @search_text = REPLACE(@search_text, ' ( ) ', '');
    END
    
    RETURN(@search_text);
    END
    GO

    So, its a scalar function which accepts a string, and returns that same string, but with it’s parentheses tidied up and “buffered” (i.e. I wrap them in spaces to make it easier for the calling function to then split the string into its component parts).

    Now, as I sit here watching the latest straight-to-dvd masterpiece that is “Barbie: A fashion fairy-tale(hey, its a Sunday afternoon and I have 3 daughters – you’d lose the battle too!), I’m having a hard time getting the code to post properly, so if you have any issues, flick me a note in the comments below and I’ll sort something else out.

    Give it a try and let me know if you find any bugs (or make any improvements!). I’ll follow up soon with posts describing the rest of the process, but for now I’ve had enough of Barbie, and think I deserve a whisky. Winking smile

    Cheers

    DB Dave