Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators
Part 3: Splitting the search string
Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two. I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!
Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.
NEAR enough…
“NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.
In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them. Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.
------------------------------------------------------------------------------------------------- -- "NEAR" OPERATOR HANDLER ------------------------------------------------------------------------------------------------- IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near') BEGIN -- reinitialise SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL; -- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...) DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR SELECT rowid FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near'; OPEN near_string; FETCH NEXT FROM near_string INTO @i; WHILE @@FETCH_STATUS = 0 BEGIN -- get the rowid of the value before and after the NEAR operator row SELECT @o = MAX(rowid) FROM @tt_searchwords WHERE rowid < @i AND operator NOT LIKE 'paren%'; SELECT @oo = MIN(rowid) FROM @tt_searchwords WHERE rowid > @i AND operator NOT LIKE 'paren%'; -- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option IF (@o IS NULL) OR (@oo IS NULL) DELETE @tt_searchwords WHERE rowid = @i; ELSE BEGIN -- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END FROM @tt_searchwords WHERE rowid = @o; SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')' FROM @tt_searchwords WHERE rowid = @oo; -- update the NEAR operator record with the concatenated search term UPDATE @tt_searchwords SET word = @current_word, operator = 'searchnear' WHERE rowid = @i; -- delete the records before and after the NEAR operator UPDATE @tt_searchwords SET operator = 'searchword_near' WHERE rowid IN (@o, @oo); END FETCH NEXT FROM near_string INTO @i; END CLOSE near_string; DEALLOCATE near_string; -- delete search words that are now included IN "NEAR phrases" DELETE @tt_searchwords WHERE operator = 'searchword_near'; END
It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!
Builderer of awesome
This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together. It’s not pretty, but it works.
------------------------------------------------------------------------------------------------- -- Final command string builderer of awesome ------------------------------------------------------------------------------------------------- -- Need to quote any search terms that contain certain non-alphanumeric characters UPDATE @tt_searchwords SET word = CASE WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"' WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"' WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"' ELSE word END WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted -- reinitialise SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = ''; -- get the range of row IDs that represent valid search terms and operators SELECT @i = MIN(rowid), @ii = MAX(rowid) FROM @tt_searchwords WHERE operator LIKE 'search%' OR operator LIKE 'paren%'; WHILE @i <= @ii BEGIN SELECT @final_search_temp = CASE REPLACE(operator, '`', '') WHEN 'op_not' THEN ' AND NOT ' WHEN 'op_or' THEN ' OR ' WHEN 'paren_o' THEN ' AND (' WHEN 'paren_x' THEN ')' ELSE ' AND ' END + CASE REPLACE(operator, '`', '') WHEN 'op_not' THEN '' -- NOT WHEN 'op_or' THEN '' -- OR WHEN 'paren_o' THEN '' -- ( WHEN 'paren_x' THEN '' -- ) WHEN 'searchnear' THEN word -- NEAR WHEN 'searchphrase' THEN word -- "search phrase" WHEN 'searchexplicit' THEN word -- explicit search WHEN 'searchprefix' THEN word -- wildcard ELSE REPLACE(@word_syntax, '<PLACEHOLDER/>', word)-- AND (default) END FROM @tt_searchwords WHERE rowid = @i; -- CONTAINSTABLE only accepts up to 4000 characters IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000 BREAK ELSE SET @final_search = @final_search + @final_search_temp; IF @i = @ii BREAK; SELECT @i = MIN(rowid) FROM @tt_searchwords WHERE rowid > @i; END -- sort out issues of operator double-ups, etc -- clunky but it works SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT '); SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR '); SET @final_search = '(' + @final_search + ')'; SET @final_search = REPLACE(@final_search, '( AND NOT ', '('); SET @final_search = REPLACE(@final_search, ' AND NOT )', ')'); SET @final_search = REPLACE(@final_search, '( AND ', '('); SET @final_search = REPLACE(@final_search, ' AND )', ')'); SET @final_search = REPLACE(@final_search, '( OR ', '('); SET @final_search = REPLACE(@final_search, ' OR )', ')'); SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);
That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.
Cheers
DB Dave
Just a note to say excellent work DBDave, and I never really do this but you have saved me a ton of work with this.
Given me lots of new ideas to play woth as well.
Again Superb
Thanks Peter – glad you got some value from it. Let me know if you end up making any improvements, etc. 😉
Pingback: Full-Text Search Operators – Part 1: Cleaning up parentheses in a string - database dave
Pingback: SQL Server word-level bigram function - database dave