SQL Server word-level bigram function

I work for an online auction website, Trade Me, and as such collect quite a large quantity of search data (i.e. full text searches that people perform).  Over the years I’ve been involved in various aspects of our search; from tweaking the nuts and bolts of the full-text search itself (see my blog posts about boolean search operators for example) to working on our recommendation algorithm, and implementing search auto-suggest.

Something I’m looking into at the moment is changing the way we recommend alternative search terms when we think you’ve misspelt something – i.e. our “did-you-mean” system.

Did you mean…?

The way we currently do it is based on taking the individual words entered, and generating a large number of alternative spellings for each of them, and then comparing those lists of words against a “known-good” list of search terms and returning the most likely one for each word entered (according to a magic score based on frequency, average results per search, etc).  This is done on the fly and is surprisingly quick (it’s all done within TSQL functions and procedures), but I want to experiment with word level n-grams (unigrams, bigrams, and trigrams specifically).

Basically, this involves taking each search string and splitting it into individual words to give you your unigrams, or then pairing them up (in the case of bigrams).

For example, the string “pink ipod nano” would be returned as:

Word 1Word 2
#pink
pinkipod
ipodnano
nano#

The hash symbols are delimiters and are sometimes not used, in which case the 1st and last rows of this table wouldn’t be returned. In my function below, I use a less common symbol (the “Eszett”) as my delimiter, and have a parameter to allow it to be omitted if necessary.

What this allows you to do is relatively easily store a large set of known n-grams and their occurrence frequencies (i.e. how common they are), and then predict either what word the user is likely to type next (in the case of an auto-suggestion system), or what the user actually meant to type in the case of a misspelling.

For example, if a user types “ipod” followed by a space into the search box, then given a large enough working dataset you should be able to predict what the next word is likely to be (even more so as they start typing the first few letters).  Or, if the same user types in “ipod nqno“, then we should be able to work out that firstly “nqno” isn’t a common word according to our list of unigrams, and secondly that the word “ipod” followed by a space and “n” is most likely meant to be “ipod nano” – and we can then recommend that as an alternative search.

Obviously this is very high level, and there are many, many other factors and variables to take into account (which is what makes it interesting!) – but it gives you a little background to the original point of this blog post; sharing my bigram function. 🙂

The guts

It’s relatively self explanatory, and would likely be rewritten as a CLR function before going into production, but having a TSQL function allows me to test my theory myself before asking a developer to spend any time on it.

-- =============================================
-- Author:		David Curlewis
-- Create date: 2011-09-08
-- Description:	Splits a string and returns word-level bigrams (i.e. all pairs of words making up the string)
-- =============================================
CREATE FUNCTION dbo.fn_word_bigrams (
    @string             NVARCHAR(1000),
    @include_delimiters BIT = 1 -- whether to include rows for prefix and suffix delimiters
)
RETURNS @bigrams TABLE (
    row_id  INT, 
    word1   NVARCHAR(100),
    word2   NVARCHAR(100) 
)
AS
BEGIN
    DECLARE @rowcount   INT,
            @i          INT;

    DECLARE @tt_words   TABLE (row_id INT, word NVARCHAR(100))
    DECLARE @tt_bigrams TABLE (word1 NVARCHAR(100), word2 NVARCHAR(100))

    SET @i = 1;
    SET @string = ltrim(rtrim(@string)) 
        + CASE @include_delimiters WHEN 1 THEN ' ß' ELSE '' END;

    WITH split_words AS
    (
        SELECT  ROW_NUMBER() OVER(ORDER BY @@SPID) AS row_id,
                item AS word
        FROM    dbo.SplitString_Multi(@string, ' ')
    )
    , bigrams AS 
    (
        SELECT  row_id,
                CAST('ß' AS NVARCHAR(4000)) AS word1, -- ß is our delimiter
                CAST(word AS NVARCHAR(4000)) AS word2
        FROM    split_words
        WHERE   row_id = 1
        UNION ALL
        SELECT  b.row_id + 1 AS row_id,
                CAST(b.word2 AS NVARCHAR(4000)) AS word1, 
                CAST(w.word AS NVARCHAR(4000)) AS word2
        FROM    split_words AS w
        JOIN    bigrams AS b ON w.row_id = b.row_id + 1
    )
    INSERT  @bigrams
    SELECT  row_id,
            word1,
            word2
    FROM    bigrams
    WHERE   row_id > CASE @include_delimiters WHEN 1 THEN 0 ELSE 1 END;
    
    RETURN; 
END

Shout if you have any questions or comments.

Cheers,

DB Dave

2 thoughts on “SQL Server word-level bigram function”

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top