Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators
We’ve now got to the point where we have a “clean” search string. The next step is to split the string into rows which we can then manipulate without having to traverse the string itself.
There are many “split to row” table-valued functions out there, and I’d recommend using a CLR-based one if you can (for performance reasons). Basically the function accepts an input string and a delimiter, and outputs a table of individual search terms.
Instead of simply inserting these rows into a temporary table or table variable at this stage, I first use a series of CTE’s to further refine the data.
------------------------------------------------------------------------------------------------- -- SPLIT SEARCH STRING INTO INDIVIDUAL SEARCH WORDS ------------------------------------------------------------------------------------------------- ;WITH banana_split AS ( SELECT id AS rowid, LTRIM(RTRIM(data)) as word FROM [dbo].[fn_split_to_rows](@search_text, ' ') -- whatever split method you use, make sure it retains the case if you rely on UPPER-CASE operators WHERE LEN(LTRIM(RTRIM(data))) BETWEEN 1 AND 100 -- any single "word" longer than 100 chars is likely rubbish (http://en.wikipedia.org/wiki/Longest_word_in_English) ;-) ),
This then feeds into a 2nd CTE which classifies each individual search term:
words_classified AS ( -- ******************************************************************************************** -- The order of the CASE clauses below is important in deciding the order of preference -- given to the different special characters. For example, if a string contains both an -- asterisk and double quotes (e.g. "search*term") then the operator that appears first in -- the CASE statement below will take affect (e.g. either "search term" or search*). -- ******************************************************************************************** SELECT rowid, word, CASE WHEN word = '(' THEN 'paren_o' -- Opening parenthesis WHEN word = ')' THEN 'paren_x' -- Closing parenthesis WHEN LEFT(word,1) = '"' OR LEFT(word,1) = '''' THEN 'search_dq_o' -- An opening exact phrase-quote WHEN RIGHT(word,1) = '"' OR RIGHT(word,1) = '''' THEN 'search_dq_x' -- A closing exact phrase-quote WHEN LEFT(word,1) = '"' AND RIGHT(word,1) = '"' THEN 'searchexplicit' -- An exact term search (i.e. "blah") WHEN LEFT(word,1) = '''' AND RIGHT(word,1) = '''' THEN 'searchexplicit' -- An exact term search (i.e. 'blah') WHEN LEFT(word,1) = '+' THEN 'searchexplicit' -- An exact term search (i.e. +blah) -- We explicitly use a case-sensitive (binary) collation to ensure case sensitivity for operators. WHEN (word COLLATE Latin1_General_BIN2) = ('AND' COLLATE Latin1_General_BIN2) THEN 'op_and' -- AND operator WHEN (word COLLATE Latin1_General_BIN2) = ('OR' COLLATE Latin1_General_BIN2) THEN 'op_or' -- OR operator WHEN (word COLLATE Latin1_General_BIN2) = ('NEAR' COLLATE Latin1_General_BIN2) THEN 'op_near' -- NEAR operator WHEN (word COLLATE Latin1_General_BIN2) = ('NOT' COLLATE Latin1_General_BIN2) THEN 'op_not' -- NOT operator WHEN word = '`AND`' THEN 'op_and`' -- AND operator (&) WHEN word = '`OR`' THEN 'op_or`' -- OR operator (|) WHEN word = '`NEAR`' THEN 'op_near`' -- NEAR operator (~) WHEN word = '`NOT`' THEN 'op_not`' -- NOT operator (-) WHEN CHARINDEX('*', word) > 0 THEN 'searchprefix' -- Wildcard operator (*) ELSE 'searchword' END AS operator FROM banana_split ),
I wanted the search operators to be case sensitive in order not to change the behaviour of searches where people were including the words “and”, “or”, “not”, or “near”. This means that I needed to specify a case-sensitive collation for the string comparisons (obviously you can remove this if you don’t want case-sensitivity, or you database already uses a case-sensitive collation).
Now another CTE tidies up by removing some superfluous characters, etc.:
------------------------------------------------------------------------------------------------- -- NOW TIDY UP THE WORDS AND OPERATORS ------------------------------------------------------------------------------------------------- tidy_up_round1 AS ( -- tidy up some of the words SELECT rowid, REPLACE( CASE WHEN operator = 'searchexplicit' THEN REPLACE(word, '+', '') -- remove +'s FROM explicit searches WHEN operator = 'search_dq_o' THEN RIGHT(word, LEN(word)-1) -- remove the opening quote WHEN operator = 'search_dq_x' THEN LEFT(word, LEN(word)-1) -- remove the closing quote ELSE word END, '"', '') AS word, -- we can just kill all double-quotes now (any that are in the middle of words are not valid) operator FROM words_classified WHERE operator NOT IN ('op_and', 'op_and`') -- remove "AND" operators completely since these are the default inter-term operators anyway AND (PATINDEX('%[0-9a-z]%', word) > 0 OR operator LIKE 'paren%') -- only words which have 1 or more alphanumeric chars in them (unless its a parenthesis) ), tidy_up_round2 AS ( SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid, word, operator FROM tidy_up_round1 WHERE rowid BETWEEN -- exclude any operators that are incorrectly located at the beginnning or end of the string. (SELECT MIN(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or')) AND (SELECT MAX(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or')) ),
Next, remove consecutive & duplicate operators…
------------------------------------------------------------------------------------------------- -- REMOVE CONSECUTIVE OPERATORS (LIKE "OR NOT" OR "NEAR NEAR", WHICH AREN'T VALID). ------------------------------------------------------------------------------------------------- de_duped AS ( SELECT a.rowid, a.word, a.operator FROM tidy_up_round2 a LEFT JOIN tidy_up_round2 b ON a.rowid = (b.rowid + 1) -- consecutive rowid's AND a.operator LIKE 'op%' -- only operators AND b.operator LIKE 'op%' -- only operators WHERE b.rowid IS NULL ),
…and exclude any empty or unnecessary parentheses…
------------------------------------------------------------------------------------------------- -- GET PARENTHESIS GROUPS, AND THEN THE EMPTY/UNNECESSARY PAIRS (i.e. "()" or "(blah)") ------------------------------------------------------------------------------------------------- parentheses_groups AS ( SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS prowid, rowid, word, operator FROM de_duped WHERE operator IN ('paren_o', 'paren_x') OR operator LIKE 'search%' ), empty_pairs AS ( SELECT a.rowid AS rowid_o, b.rowid AS rowid_x FROM parentheses_groups a JOIN parentheses_groups b ON a.operator = 'paren_o' -- opening parenthesis AND b.operator = 'paren_x' -- closing parenthesis AND ( a.prowid = (b.prowid + 1) -- where opening and closing are consecutive rows (i.e. empty) OR a.prowid = (b.prowid + 2) -- where opening and closing are 1 row apart (i.e. they surround a single search term - this is unnecessary) ) ), ------------------------------------------------------------------------------------------------- -- NOW EXCLUDE EMPTY PARENTHESES GROUPS FROM THE FINAL RESULT SET ------------------------------------------------------------------------------------------------- no_empties AS ( SELECT a.rowid, a.word, a.operator FROM de_duped a LEFT JOIN empty_pairs x ON (a.rowid = x.rowid_o OR a.rowid = x.rowid_x) WHERE x.rowid_o IS NULL ),
And lastly, format any wildcard (i.e. prefix) search terms, and insert the results into a table variable:
------------------------------------------------------------------------------------------------- -- WILDCARD HANDLER ------------------------------------------------------------------------------------------------- wildcard_handler AS ( SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid, CASE operator WHEN 'searchprefix' THEN '"' + CASE WHEN (LEFT(word, 1) <> '*') AND (RIGHT(word, 1) <> '*') THEN LEFT(word, CHARINDEX('*', word)-1) ELSE REPLACE(word, '*', '') END + '*"' ELSE word END AS word, operator FROM no_empties ) ------------------------------------------------------------------------------------------------- -- FINALLY, INSERT THE RESULTS INTO A TABLE VARIABLE ------------------------------------------------------------------------------------------------- INSERT @tt_searchwords (rowid, word, operator) SELECT rowid, word, operator FROM wildcard_handler;
Phew. Enough for this post I think.
So, at this point if I pass in the following search string – sql NEAR server -(oracle | mysql) – then my table variable should be populated with the following:
In the next (and final) post, I’ll wrap up the “near” operator & double-quote handlers, and the building of the final output string.
See ya!
DB Dave
Pingback: Full-Text Search Operators – Part 1: Cleaning up parentheses in a string - database dave
Pingback: Full-Text Search Operators – Part 4: Wrapping it up - database dave