In this post I’m going to deal with another piece of the puzzle that I described in part 1. Now, after re-reading that post it became obvious that I probably need to outline what the end result is meant to look like:
We want to be able to pass in a search string, and get back a nicely formatted full-text search clause which makes full use of Boolean operators, respects parentheses, and supports wildcards. Easy, right? Yup, and we need to do it all using only T/SQL.
Here’s the steps that make up the final function (in a nutshell):
- Remove some dodgy characters (vertical-tab, form-feed, etc)
- Tidy up any parentheses [part 1]
- Replace any alternate operators with place-holders (most operators have 2 representations; e.g. OR and |, AND and &, NOT and –, etc).
- Break the search term into individual words (i.e. split the string on space characters)
- Using this list of individual words, do the following:
- Classify each word as an operator, search term, parenthesis, wildcard, etc.
- Tidy up the list, removing illegal and unnecessary characters
- Remove invalid operators (e.g. “ipod NOT” is not a valid search term)
- Remove unnecessary parentheses (e.g. single words wrapped in parentheses)
- Handle wildcards. Since full-text only accepts prefix searches, change all invalid wildcard searches into prefix searches.
- Check for single and double-quotes (& tidy up any invalid quote-marks, make sure only valid pairs remain, etc.).
- Process any NEAR(~) operators.
- Put the final Full-Text command string together.
Right, so we’ve already taken care of number 2 in the previous post, so now on to number 3 – replacing the alternate operators. This step may not be necessary for you, since it’s sole purpose is to preserve the original search string for our reporting purposes.
You see, if I search for “ipad OR iphone NOT ipod” and you search for “ipad | iphone –ipod”, then obviously since they’re really the same search we will get the same results (and our cache should also cache them as 1 search), but for reporting purposes we want to know that our searches were typed in differently. The way we do that in our implementation of this function, is that we return a normalised search string used to perform & cache the search, and an original string used for reporting.
Here’s the very straight forward code:
IF OBJECT_ID('[dbo].[fn_search_clause_operator_replacer]') IS NULL EXEC('CREATE FUNCTION [dbo].[fn_search_clause_operator_replacer] () RETURNS INT AS BEGIN RETURN(0); END '); GO /*************************************************************************************************************************** This is a supporting function, which is called by the function [fn_search_clause_get]. It accepts a string input, and outputs the same string, but with any character operators replaced with placeholder text-operators instead. Author : David Curlewis Date : 07/2009 This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/ ****************************************************************************************************************************/ ALTER FUNCTION [dbo].[fn_search_clause_operator_replacer] ( @search_text NVARCHAR(4000) ) RETURNS NVARCHAR(4000) AS BEGIN DECLARE @not_pattern VARCHAR(50), @not_pos INT; SET @search_text = REPLACE(REPLACE(REPLACE(@search_text, '|', ' `OR` '), -- OR '~', ' `NEAR` '), -- NEAR '&', ' `AND` '); -- AND -- the NOT operator requires some extra work because we only consider it an operator if it is not hyphenating a word SET @not_pattern = '%[^a-z0-9][-][ a-z0-9"''(]%'; -- valid NOT operator pattern SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get the positions of any valid "-" operators WHILE (@not_pos > 0) BEGIN SET @search_text = STUFF(@search_text, @not_pos, 2, ' `NOT` '); SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get pos of next operators (if any) END RETURN(@search_text); END GO
The first bit is very simple; it’s just a REPLACE statement replacing each operator’s “symbol operator” (i.e. |, ~, &) with a bit of placeholder text (i.e. `OR`, `NEAR`, `AND`).
The NOT operators require a little more work because we don’t want to interpret all hyphens as operators, otherwise you wouldn’t be able to search for hyphenated words. Using PATINDEX we can very easily match on a simple RegEx-like pattern: [^a-z0-9][-][ a-z0-9″”(]
This looks for any occurrences of the hyphen, where it follows a non-alphanumeric character, and precedes a space, an alphanumeric character, a double or single quote, or an opening parenthesis. It then replaces each occurrence with the placeholder text `NOT`.
In the next post I’ll go through the meatiest bit, which involves splitting the string (number 4), and preparing the resulting data (number 5).
Cheers
DB Dave
Pingback: Full-Text Search Operators – Part 1: Cleaning up parentheses in a string | David Curlewis
Pingback: Full-Text Search Operators – Part 4: Wrapping it up | David Curlewis
Pingback: Full-Text Search Operators – Part 3: Splitting the search string - database dave