I work for an online auction website, Trade Me, and as such collect quite a large quantity of search data (i.e. full text searches that people perform). Over the years I’ve been involved in various aspects of our search; from tweaking the nuts and bolts of the full-text search itself (see my blog posts about boolean search operators for example) to working on our recommendation algorithm, and implementing search auto-suggest.
Something I’m looking into at the moment is changing the way we recommend alternative search terms when we think you’ve misspelt something – i.e. our “did-you-mean” system.
Did you mean…?
The way we currently do it is based on taking the individual words entered, and generating a large number of alternative spellings for each of them, and then comparing those lists of words against a “known-good” list of search terms and returning the most likely one for each word entered (according to a magic score based on frequency, average results per search, etc). This is done on the fly and is surprisingly quick (it’s all done within TSQL functions and procedures), but I want to experiment with word level n-grams (unigrams, bigrams, and trigrams specifically).
Basically, this involves taking each search string and splitting it into individual words to give you your unigrams, or then pairing them up (in the case of bigrams).
For example, the string “pink ipod nano” would be returned as:
Word 1 | Word 2 |
# | pink |
pink | ipod |
ipod | nano |
nano | # |
The hash symbols are delimiters and are sometimes not used, in which case the 1st and last rows of this table wouldn’t be returned. In my function below, I use a less common symbol (the “Eszett”) as my delimiter, and have a parameter to allow it to be omitted if necessary.
What this allows you to do is relatively easily store a large set of known n-grams and their occurrence frequencies (i.e. how common they are), and then predict either what word the user is likely to type next (in the case of an auto-suggestion system), or what the user actually meant to type in the case of a misspelling.
For example, if a user types “ipod” followed by a space into the search box, then given a large enough working dataset you should be able to predict what the next word is likely to be (even more so as they start typing the first few letters). Or, if the same user types in “ipod nqno“, then we should be able to work out that firstly “nqno” isn’t a common word according to our list of unigrams, and secondly that the word “ipod” followed by a space and “n” is most likely meant to be “ipod nano” – and we can then recommend that as an alternative search.
Obviously this is very high level, and there are many, many other factors and variables to take into account (which is what makes it interesting!) – but it gives you a little background to the original point of this blog post; sharing my bigram function. 🙂
The guts
It’s relatively self explanatory, and would likely be rewritten as a CLR function before going into production, but having a TSQL function allows me to test my theory myself before asking a developer to spend any time on it.
-- ============================================= -- Author: David Curlewis -- Create date: 2011-09-08 -- Description: Splits a string and returns word-level bigrams (i.e. all pairs of words making up the string) -- ============================================= CREATE FUNCTION dbo.fn_word_bigrams ( @string NVARCHAR(1000), @include_delimiters BIT = 1 -- whether to include rows for prefix and suffix delimiters ) RETURNS @bigrams TABLE ( row_id INT, word1 NVARCHAR(100), word2 NVARCHAR(100) ) AS BEGIN DECLARE @rowcount INT, @i INT; DECLARE @tt_words TABLE (row_id INT, word NVARCHAR(100)) DECLARE @tt_bigrams TABLE (word1 NVARCHAR(100), word2 NVARCHAR(100)) SET @i = 1; SET @string = ltrim(rtrim(@string)) + CASE @include_delimiters WHEN 1 THEN ' ß' ELSE '' END; WITH split_words AS ( SELECT ROW_NUMBER() OVER(ORDER BY @@SPID) AS row_id, item AS word FROM dbo.SplitString_Multi(@string, ' ') ) , bigrams AS ( SELECT row_id, CAST('ß' AS NVARCHAR(4000)) AS word1, -- ß is our delimiter CAST(word AS NVARCHAR(4000)) AS word2 FROM split_words WHERE row_id = 1 UNION ALL SELECT b.row_id + 1 AS row_id, CAST(b.word2 AS NVARCHAR(4000)) AS word1, CAST(w.word AS NVARCHAR(4000)) AS word2 FROM split_words AS w JOIN bigrams AS b ON w.row_id = b.row_id + 1 ) INSERT @bigrams SELECT row_id, word1, word2 FROM bigrams WHERE row_id > CASE @include_delimiters WHEN 1 THEN 0 ELSE 1 END; RETURN; END
Shout if you have any questions or comments.
Cheers,
DB Dave
This is awesome. Have you evolved it for 2013?
Very nice! Do you think that extend it for a n-gram purpose is it doable?