Full-Text Search Operators – Part 3: Splitting the search string

Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators

We’ve now got to the point where we have a “clean” search string.  The next step is to split the string into rows which we can then manipulate without having to traverse the string itself.

There are many “split to row” table-valued functions out there, and I’d recommend using a CLR-based one if you can (for performance reasons).  Basically the function accepts an input string and a delimiter, and outputs a table of individual search terms.

Instead of simply inserting these rows into a temporary table or table variable at this stage, I first use a series of CTE’s to further refine the data.

-------------------------------------------------------------------------------------------------
-- SPLIT SEARCH STRING INTO INDIVIDUAL SEARCH WORDS
-------------------------------------------------------------------------------------------------
;WITH
banana_split AS
(
SELECT id AS rowid,
LTRIM(RTRIM(data)) as word
FROM [dbo].[fn_split_to_rows](@search_text, ' ') -- whatever split method you use, make sure it retains the case if you rely on UPPER-CASE operators
WHERE LEN(LTRIM(RTRIM(data))) BETWEEN 1 AND 100 -- any single "word" longer than 100 chars is likely rubbish (http://en.wikipedia.org/wiki/Longest_word_in_English) ;-)
),

This then feeds into a 2nd CTE which classifies each individual search term:

words_classified AS
(
-- ********************************************************************************************
-- The order of the CASE clauses below is important in deciding the order of preference
-- given to the different special characters. For example, if a string contains both an
-- asterisk and double quotes (e.g. "search*term") then the operator that appears first in
-- the CASE statement below will take affect (e.g. either "search term" or search*).
-- ********************************************************************************************
SELECT rowid,
word,
CASE
WHEN word = '(' THEN 'paren_o' -- Opening parenthesis
WHEN word = ')' THEN 'paren_x' -- Closing parenthesis
WHEN LEFT(word,1) = '"' OR LEFT(word,1) = '''' THEN 'search_dq_o' -- An opening exact phrase-quote
WHEN RIGHT(word,1) = '"' OR RIGHT(word,1) = '''' THEN 'search_dq_x' -- A closing exact phrase-quote
WHEN LEFT(word,1) = '"' AND RIGHT(word,1) = '"' THEN 'searchexplicit' -- An exact term search (i.e. "blah")
WHEN LEFT(word,1) = '''' AND RIGHT(word,1) = '''' THEN 'searchexplicit' -- An exact term search (i.e. 'blah')
WHEN LEFT(word,1) = '+' THEN 'searchexplicit' -- An exact term search (i.e. +blah)
-- We explicitly use a case-sensitive (binary) collation to ensure case sensitivity for operators.
WHEN (word COLLATE Latin1_General_BIN2) = ('AND' COLLATE Latin1_General_BIN2) THEN 'op_and' -- AND operator
WHEN (word COLLATE Latin1_General_BIN2) = ('OR' COLLATE Latin1_General_BIN2) THEN 'op_or' -- OR operator
WHEN (word COLLATE Latin1_General_BIN2) = ('NEAR' COLLATE Latin1_General_BIN2) THEN 'op_near' -- NEAR operator
WHEN (word COLLATE Latin1_General_BIN2) = ('NOT' COLLATE Latin1_General_BIN2) THEN 'op_not' -- NOT operator
WHEN word = '`AND`' THEN 'op_and`' -- AND operator (&)
WHEN word = '`OR`' THEN 'op_or`' -- OR operator (|)
WHEN word = '`NEAR`' THEN 'op_near`' -- NEAR operator (~)
WHEN word = '`NOT`' THEN 'op_not`' -- NOT operator (-)
WHEN CHARINDEX('*', word) > 0 THEN 'searchprefix' -- Wildcard operator (*)
ELSE 'searchword' END AS operator
FROM banana_split
),

I wanted the search operators to be case sensitive in order not to change the behaviour of searches where people were including the words “and”, “or”, “not”, or “near”. This means that I needed to specify a case-sensitive collation for the string comparisons (obviously you can remove this if you don’t want case-sensitivity, or you database already uses a case-sensitive collation).

Now another CTE tidies up by removing some superfluous characters, etc.:

-------------------------------------------------------------------------------------------------
-- NOW TIDY UP THE WORDS AND OPERATORS
-------------------------------------------------------------------------------------------------
tidy_up_round1 AS
(
-- tidy up some of the words
SELECT rowid,
REPLACE(
CASE
WHEN operator = 'searchexplicit' THEN REPLACE(word, '+', '') -- remove +'s FROM explicit searches
WHEN operator = 'search_dq_o' THEN RIGHT(word, LEN(word)-1) -- remove the opening quote
WHEN operator = 'search_dq_x' THEN LEFT(word, LEN(word)-1) -- remove the closing quote
ELSE word END,
'"', '') AS word, -- we can just kill all double-quotes now (any that are in the middle of words are not valid)
operator
FROM words_classified
WHERE operator NOT IN ('op_and', 'op_and`') -- remove "AND" operators completely since these are the default inter-term operators anyway
AND (PATINDEX('%[0-9a-z]%', word) > 0 OR operator LIKE 'paren%') -- only words which have 1 or more alphanumeric chars in them (unless its a parenthesis)
),
tidy_up_round2 AS
(
SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
word,
operator
FROM tidy_up_round1
WHERE rowid BETWEEN -- exclude any operators that are incorrectly located at the beginnning or end of the string.
(SELECT MIN(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
AND (SELECT MAX(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
),

Next, remove consecutive & duplicate operators…

-------------------------------------------------------------------------------------------------
-- REMOVE CONSECUTIVE OPERATORS (LIKE "OR NOT" OR "NEAR NEAR", WHICH AREN'T VALID).
-------------------------------------------------------------------------------------------------
de_duped AS
(
SELECT a.rowid,
a.word,
a.operator
FROM tidy_up_round2 a
LEFT JOIN tidy_up_round2 b
ON a.rowid = (b.rowid + 1) -- consecutive rowid's
AND a.operator LIKE 'op%' -- only operators
AND b.operator LIKE 'op%' -- only operators
WHERE b.rowid IS NULL
),

…and exclude any empty or unnecessary parentheses…

-------------------------------------------------------------------------------------------------
-- GET PARENTHESIS GROUPS, AND THEN THE EMPTY/UNNECESSARY PAIRS (i.e. "()" or "(blah)")
-------------------------------------------------------------------------------------------------
parentheses_groups AS
(
SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS prowid,
rowid,
word,
operator
FROM de_duped
WHERE operator IN ('paren_o', 'paren_x')
OR operator LIKE 'search%'
),
empty_pairs AS
(
SELECT a.rowid AS rowid_o,
b.rowid AS rowid_x
FROM parentheses_groups a
JOIN parentheses_groups b
ON a.operator = 'paren_o' -- opening parenthesis
AND b.operator = 'paren_x' -- closing parenthesis
AND (
a.prowid = (b.prowid + 1) -- where opening and closing are consecutive rows (i.e. empty)
OR a.prowid = (b.prowid + 2) -- where opening and closing are 1 row apart (i.e. they surround a single search term - this is unnecessary)
)
),
-------------------------------------------------------------------------------------------------
-- NOW EXCLUDE EMPTY PARENTHESES GROUPS FROM THE FINAL RESULT SET
-------------------------------------------------------------------------------------------------
no_empties AS
(
SELECT a.rowid,
a.word,
a.operator
FROM de_duped a
LEFT JOIN empty_pairs x
ON (a.rowid = x.rowid_o OR a.rowid = x.rowid_x)
WHERE x.rowid_o IS NULL
),

And lastly, format any wildcard (i.e. prefix) search terms, and insert the results into a table variable:

-------------------------------------------------------------------------------------------------
-- WILDCARD HANDLER
-------------------------------------------------------------------------------------------------
wildcard_handler AS
(
SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
CASE operator
WHEN 'searchprefix' THEN
'"' + CASE
WHEN (LEFT(word, 1) <> '*') AND (RIGHT(word, 1) <> '*') THEN LEFT(word, CHARINDEX('*', word)-1)
ELSE REPLACE(word, '*', '')
END + '*"'
ELSE word
END AS word,
operator
FROM no_empties
)
-------------------------------------------------------------------------------------------------
-- FINALLY, INSERT THE RESULTS INTO A TABLE VARIABLE
-------------------------------------------------------------------------------------------------
INSERT @tt_searchwords (rowid, word, operator)
SELECT rowid, word, operator
FROM wildcard_handler;

Phew. Enough for this post I think.

So, at this point if I pass in the following search string – sql NEAR server -(oracle | mysql)then my table variable should be populated with the following:

Table variable contents

In the next (and final) post, I’ll wrap up the “near” operator & double-quote handlers, and the building of the final output string.

See ya!

DB Dave

2 thoughts on “Full-Text Search Operators – Part 3: Splitting the search string”

  1. Pingback: Full-Text Search Operators – Part 1: Cleaning up parentheses in a string - database dave

  2. Pingback: Full-Text Search Operators – Part 4: Wrapping it up - database dave

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top