Category: SQL Server

  • Multiple CTE gotcha

    This is probably one of those things you’ll read and think “What a dumbass, I’d never make that mistake!” – which is fine, but I thought I’d post about it nevertheless since it managed to trip me up twice in the space of a week. Smile

    I’ve got a script which grabs data from several large tables (10’s of millions of rows each), and aggregates and sanitises the data in 5 or 6 distinct steps.  I used to do this using several temp tables in the “bad old days”, but now use easier to read (to me anyway) Common Table Expressions – or CTE’s.

    This post isn’t going to tell you how to use CTE’s – there’s a pile of great stuff out there waiting for you to dig it up using your search engine of choice.

    This post will however demonstrate how 3 CTE’s managed to first trick me into thinking I was a performance tuning god, and then confound me with an “impossible to fix” bug.

    Consider the following pseudo code:

    ;with cte_1 as
    (
    select blah,
    COUNT(*),
    AVG(value)
    from a_very_big_table
    group by blah
    having AVG(value) >= 10
    ),
    cte_2 as
    (
    select some_scaler_function(blah),
    some_value,
    a_value
    from cte_1
    where some_clause
    ),
    cte_3 as
    (
    select some_values...
    from cte_1
    where some_criteria
    group by some_more_grouping
    )
    insert somewhere
    select some_stuff
    from cte_3

    I wrote something similar to this to grab some data from a large table, whittle it down with some WHERE clauses in the 1st CTE, then run some gnarly functions and more WHERE clauses in the 2nd CTE, and finally some more aggregations in a 3rd CTE before finally inserting the results into a staging table. So each CTE should reference the CTE above it.

    "Face Palm" by tarabrown
    “Doh!”

    As you can see above though, CTE_3 is referencing CTE_1 instead of CTE_2.  This basically means that SQL ignores CTE_2 (since it’s output isn’t used anywhere).  This explains why I thought I was a performance tuning god – I made a few changes to the query, and the run-time went from 6 minutes to 12 seconds! I also soon found that the data didn’t look right, and it took 45 minutes of pointless poking around before I had that forehead-smacking realisation that a typo was the cause of my pain (and my runtime was back up to 6 minutes, awww man!).

    Yes, this could be (and was) done in a single statement, but I broke it up into separate CTE’s for readability.

    Cheers
    DB Dave

  • Full-Text Search Operators – Part 4: Wrapping it up

    "iPod nano 4Gb wrapping" by Darkmere
    That’s a wrap!

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators
    Part 3: Splitting the search string

    Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two.  I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!

    Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.

    NEAR enough…

    “NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.

    In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them.  Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
    So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.

    -------------------------------------------------------------------------------------------------
    -- "NEAR" OPERATOR HANDLER
    -------------------------------------------------------------------------------------------------
    IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near')
    BEGIN
    -- reinitialise
    SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL;
    -- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...)
    DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR
    SELECT rowid
    FROM @tt_searchwords
    WHERE REPLACE(operator, '`', '') = 'op_near';
    
    OPEN near_string;
    FETCH NEXT FROM near_string INTO @i;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    -- get the rowid of the value before and after the NEAR operator row
    SELECT @o = MAX(rowid)
    FROM @tt_searchwords
    WHERE rowid < @i
    AND operator NOT LIKE 'paren%';
    
    SELECT @oo = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i
    AND operator NOT LIKE 'paren%';
    
    -- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option
    IF (@o IS NULL) OR (@oo IS NULL)
    DELETE @tt_searchwords
    WHERE rowid = @i;
    ELSE BEGIN
    -- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator
    SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END
    FROM @tt_searchwords
    WHERE rowid = @o;
    
    SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')'
    FROM @tt_searchwords
    WHERE rowid = @oo;
    
    -- update the NEAR operator record with the concatenated search term
    UPDATE @tt_searchwords
    SET word = @current_word,
    operator = 'searchnear'
    WHERE rowid = @i;
    
    -- delete the records before and after the NEAR operator
    UPDATE @tt_searchwords
    SET operator = 'searchword_near'
    WHERE rowid IN (@o, @oo);
    END
    
    FETCH NEXT FROM near_string INTO @i;
    END
    
    CLOSE near_string;
    DEALLOCATE near_string;
    
    -- delete search words that are now included IN "NEAR phrases"
    DELETE @tt_searchwords
    WHERE operator = 'searchword_near';
    END

    It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!

    Builderer of awesome

    This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together.  It’s not pretty, but it works.

    -------------------------------------------------------------------------------------------------
    -- Final command string builderer of awesome
    -------------------------------------------------------------------------------------------------
    -- Need to quote any search terms that contain certain non-alphanumeric characters
    UPDATE @tt_searchwords
    SET word = CASE
    WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"'
    ELSE word END
    WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted
    -- reinitialise
    SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = '';
    
    -- get the range of row IDs that represent valid search terms and operators
    SELECT @i = MIN(rowid),
    @ii = MAX(rowid)
    FROM @tt_searchwords
    WHERE operator LIKE 'search%'
    OR operator LIKE 'paren%';
    
    WHILE @i <= @ii
    BEGIN
    SELECT @final_search_temp =
    CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN ' AND NOT '
    WHEN 'op_or' THEN ' OR '
    WHEN 'paren_o' THEN ' AND ('
    WHEN 'paren_x' THEN ')'
    ELSE ' AND '
    END
    + CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN '' -- NOT
    WHEN 'op_or' THEN '' -- OR
    WHEN 'paren_o' THEN '' -- (
    WHEN 'paren_x' THEN '' -- )
    WHEN 'searchnear' THEN word -- NEAR
    WHEN 'searchphrase' THEN word -- "search phrase"
    WHEN 'searchexplicit' THEN word -- explicit search
    WHEN 'searchprefix' THEN word -- wildcard
    ELSE REPLACE(@word_syntax, '<PLACEHOLDER/>', word)-- AND (default)
    END
    FROM @tt_searchwords
    WHERE rowid = @i;
    
    -- CONTAINSTABLE only accepts up to 4000 characters
    IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000
    BREAK
    ELSE
    SET @final_search = @final_search + @final_search_temp;
    
    IF @i = @ii BREAK;
    
    SELECT @i = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i;
    END
    
    -- sort out issues of operator double-ups, etc
    -- clunky but it works
    SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT ');
    SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR ');
    SET @final_search = '(' + @final_search + ')';
    SET @final_search = REPLACE(@final_search, '( AND NOT ', '(');
    SET @final_search = REPLACE(@final_search, ' AND NOT )', ')');
    SET @final_search = REPLACE(@final_search, '( AND ', '(');
    SET @final_search = REPLACE(@final_search, ' AND )', ')');
    SET @final_search = REPLACE(@final_search, '( OR ', '(');
    SET @final_search = REPLACE(@final_search, ' OR )', ')');
    SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);

    That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.

    Cheers
    DB Dave

  • Benchmarking I/O: The test rig

    Work has been a little crazy lately, but I’ve still been wanting to do some I/O performance testing since we’re due to get some new servers in soon and I wanted to be able to compare apples with oranges. So this past weekend I decided to combine something I’ve been meaning to learn more about (PowerShell) and some SQL scripts based on those described by Paul Randal (blog | twitter) in his post about testing Fusion-IO ioDrive Duo SSD’s.

    Just to be clear before going any further, I’ve tested these scripts in small scale (i.e. on my laptop) and they worked well, but I’m happy to have any bugs and/or crappy code pointed out to me in the comments. 🙂 Thanks.

    So, because I tend to over-engineer things, I laid out my requirements early on to prevent any self-induced scope-creep. These were:

    • It must accept parameters for anything the user would need to change
    • It must create the test database and any other required objects
    • It should automatically re-run the test the specified number of times (to allow averaging of results)
    • It should automatically record test results without requiring me to intervene between test runs
    • If possible, it should be able to automatically change test configurations (e.g. number of database data files) and re-run the tests without intervention

    The jury’s still out on how many of these goals I achieved, but let’s dive in anyway. I’m not going to post snippets of all the code – otherwise it’ll just get too bulky.  Instead I’ll highlight the more important bits, and then upload the whole shebang at the end for you to download.

    The PowerShell script basically controls the flow of the test process, but there are 5 SQL scripts which do the actual heavy lifting. The first 2 files create the database (with the specified number of data files, and of the specified size), as well as a few tables used in the test:

    -------------------------------------------------------------------------- 
    -- Create test database 
    --------------------------------------------------------------------------
    SET @cmd = N'
    	IF DB_ID(''' + @db_name + ''') IS NOT NULL DROP DATABASE [' + @db_name + ']; 
    	CREATE DATABASE [' + @db_name + '] ON PRIMARY (
    		NAME = N''' + @db_name + ''', 
    		SIZE = ' + CAST(@data_file_size_mb AS NVARCHAR) + 'MB , 
    		FILEGROWTH = 250MB, 
    		FILENAME = N''' + @db_data_path + @db_name + '.mdf'')'; 
    
    -- Add additional files 
    WHILE @i < $(NoOfDataFiles) 
    BEGIN 
    	SET @cmd = @cmd + CHAR(10) + ',(NAME = N''' + @db_name + CAST(@i AS NVARCHAR) 
    		+ ''', SIZE = ' + CAST(@data_file_size_mb AS NVARCHAR) + 'MB , FILEGROWTH = 250MB, FILENAME = N''' 
    		+ @db_data_path + @db_name + CAST(@i AS NVARCHAR) + '.ndf'')'; 
    		
    	SET @i = @i + 1; 
    END 
    
    SET @cmd = @cmd + CHAR(10) + ' LOG ON (NAME = N''' + @db_name + '_log'', SIZE = ' 
    	+ CAST(CASE WHEN $(TargetDbSizeGB) <= 1 THEN 128 WHEN $(TargetDbSizeGB) > 1 AND $(TargetDbSizeGB) <= 5 
    		THEN 512 WHEN $(TargetDbSizeGB) > 5 AND $(TargetDbSizeGB) <= 10 THEN 2048 ELSE 4096 END AS NVARCHAR) 
    		+ 'MB, FILEGROWTH = 100MB, FILENAME = N''' + @db_log_path + @db_name + '_log.ldf''); ALTER DATABASE [' 
    		+ @db_name + '] SET RECOVERY SIMPLE;'; 
    
    EXEC sp_executesql @cmd, N'';

    The main work table is a copy & paste from Paul’s post mentioned earlier, and has a clustered index on a GUID column which forces random I/O. I guess you could change this to an integer/identity column to change this to sequential I/O instead.

    The 3rd SQL script performs the inserts, and is copied from a related post from Thomas Kejser (blog) – I just added a bit that works out how many rows to insert based on the target database size specified.

    SELECT @counter = 0, @start = GETDATE(); 
    WHILE (@counter < @no_of_rows_to_insert_per) 
    BEGIN 
    	IF @counter % 1000 = 0 
    	BEGIN TRAN; 
    		INSERT INTO MyBigTable DEFAULT VALUES; 
    		SET @counter = @counter + 1; 
    		
    		IF @counter % 1000 = 999 
    			COMMIT TRAN; 
    END; 
    
    SET @end = GETDATE(); 
    INSERT INTO Results (duration) VALUES (CAST(DATEDIFF(SS, @start, @end) AS INT));

    The 4th script is directly from this post of Paul’s on wait stats – its definitely one worth adding to your toolbox.

    Finally, the 5th script aggregates each test run’s results into one row, so that the test rig can automatically blow everything away and run again without losing any results.

    One of the biggest challenges I faced (as the PowerShell n00b that I am) was figuring out how to asynchronously launch multiple processes and wait for them to finish.  In “batch file land” I would’ve just thrown together some “FOR %%i IN… DO START sqlcmd….” magic, but I wouldn’t allow myself to use anything but PowerShell for the flow control.  After some time spent searching around I found Start-Job which allowed me to launch a given number of instances of SQLCMD and then wait for them to all finish by calling Wait-Job (obviously!). Winking smile

    ###################################################################### 
    # Kick off the specified number of insert processes ($NoOfInsertProcesses) 
    ###################################################################### 
    $sj = { param($SQLInstanceName, $TestDBName, $InsertRowsPath, $TargetDbSizeGB, $ApproxRowByteSize, $NoOfInsertProcesses) sqlcmd -S "$SQLInstanceName" -d "$TestDBName" -i "$InsertRowsPath" -v TargetDbSizeGB=$TargetDbSizeGB ApproxRowByteSize=$ApproxRowByteSize NoOfInsertProcesses=$NoOfInsertProcesses } $(for ($x=1; $x -le $NoOfInsertProcesses; $x++){ Start-Job $sj -ArgumentList $SQLInstanceName, $TestDBName, $InsertRowsPath, $TargetDbSizeGB, $ApproxRowByteSize, $NoOfInsertProcesses } ) | Wait-Job #Wait for all insert processes to finish before continuing

    Funnily enough, one of my main “Aha!” moments was had while reading this post by Aaron Bertrand (blog | twitter) who isn’t your typical hard-core PowerShell or Windows Server blogger, but instead a SQL Server MVP – so either it was a sign I was on the right path, or Google knows far too much about my reading habits. Smile

    Okay – please feel free to download the scripts, tear them apart, point and laugh, etc (although, if you point and laugh, please at least tell me why so I can learn from it). As I said before, I haven’t run them on anything apart from my laptop (which I think would’ve taken several years to perform the full-blown test), so I’d be interested to hear of your experiences.

    IOTestScripts

    In the coming weeks I’m hoping to benchmark several of our existing servers (sitting on top of various storage platforms, such as SAN – both solid state & fibre channel, local disk, NAS, and RAM disks).

    Cheers,

    DB Dave

  • Full-Text Search Operators – Part 3: Splitting the search string

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators

    We’ve now got to the point where we have a “clean” search string.  The next step is to split the string into rows which we can then manipulate without having to traverse the string itself.

    There are many “split to row” table-valued functions out there, and I’d recommend using a CLR-based one if you can (for performance reasons).  Basically the function accepts an input string and a delimiter, and outputs a table of individual search terms.

    Instead of simply inserting these rows into a temporary table or table variable at this stage, I first use a series of CTE’s to further refine the data.

    -------------------------------------------------------------------------------------------------
    -- SPLIT SEARCH STRING INTO INDIVIDUAL SEARCH WORDS
    -------------------------------------------------------------------------------------------------
    ;WITH
    banana_split AS
    (
    SELECT id AS rowid,
    LTRIM(RTRIM(data)) as word
    FROM [dbo].[fn_split_to_rows](@search_text, ' ') -- whatever split method you use, make sure it retains the case if you rely on UPPER-CASE operators
    WHERE LEN(LTRIM(RTRIM(data))) BETWEEN 1 AND 100 -- any single "word" longer than 100 chars is likely rubbish (http://en.wikipedia.org/wiki/Longest_word_in_English) ;-)
    ),

    This then feeds into a 2nd CTE which classifies each individual search term:

    words_classified AS
    (
    -- ********************************************************************************************
    -- The order of the CASE clauses below is important in deciding the order of preference
    -- given to the different special characters. For example, if a string contains both an
    -- asterisk and double quotes (e.g. "search*term") then the operator that appears first in
    -- the CASE statement below will take affect (e.g. either "search term" or search*).
    -- ********************************************************************************************
    SELECT rowid,
    word,
    CASE
    WHEN word = '(' THEN 'paren_o' -- Opening parenthesis
    WHEN word = ')' THEN 'paren_x' -- Closing parenthesis
    WHEN LEFT(word,1) = '"' OR LEFT(word,1) = '''' THEN 'search_dq_o' -- An opening exact phrase-quote
    WHEN RIGHT(word,1) = '"' OR RIGHT(word,1) = '''' THEN 'search_dq_x' -- A closing exact phrase-quote
    WHEN LEFT(word,1) = '"' AND RIGHT(word,1) = '"' THEN 'searchexplicit' -- An exact term search (i.e. "blah")
    WHEN LEFT(word,1) = '''' AND RIGHT(word,1) = '''' THEN 'searchexplicit' -- An exact term search (i.e. 'blah')
    WHEN LEFT(word,1) = '+' THEN 'searchexplicit' -- An exact term search (i.e. +blah)
    -- We explicitly use a case-sensitive (binary) collation to ensure case sensitivity for operators.
    WHEN (word COLLATE Latin1_General_BIN2) = ('AND' COLLATE Latin1_General_BIN2) THEN 'op_and' -- AND operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('OR' COLLATE Latin1_General_BIN2) THEN 'op_or' -- OR operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('NEAR' COLLATE Latin1_General_BIN2) THEN 'op_near' -- NEAR operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('NOT' COLLATE Latin1_General_BIN2) THEN 'op_not' -- NOT operator
    WHEN word = '`AND`' THEN 'op_and`' -- AND operator (&)
    WHEN word = '`OR`' THEN 'op_or`' -- OR operator (|)
    WHEN word = '`NEAR`' THEN 'op_near`' -- NEAR operator (~)
    WHEN word = '`NOT`' THEN 'op_not`' -- NOT operator (-)
    WHEN CHARINDEX('*', word) > 0 THEN 'searchprefix' -- Wildcard operator (*)
    ELSE 'searchword' END AS operator
    FROM banana_split
    ),

    I wanted the search operators to be case sensitive in order not to change the behaviour of searches where people were including the words “and”, “or”, “not”, or “near”. This means that I needed to specify a case-sensitive collation for the string comparisons (obviously you can remove this if you don’t want case-sensitivity, or you database already uses a case-sensitive collation).

    Now another CTE tidies up by removing some superfluous characters, etc.:

    -------------------------------------------------------------------------------------------------
    -- NOW TIDY UP THE WORDS AND OPERATORS
    -------------------------------------------------------------------------------------------------
    tidy_up_round1 AS
    (
    -- tidy up some of the words
    SELECT rowid,
    REPLACE(
    CASE
    WHEN operator = 'searchexplicit' THEN REPLACE(word, '+', '') -- remove +'s FROM explicit searches
    WHEN operator = 'search_dq_o' THEN RIGHT(word, LEN(word)-1) -- remove the opening quote
    WHEN operator = 'search_dq_x' THEN LEFT(word, LEN(word)-1) -- remove the closing quote
    ELSE word END,
    '"', '') AS word, -- we can just kill all double-quotes now (any that are in the middle of words are not valid)
    operator
    FROM words_classified
    WHERE operator NOT IN ('op_and', 'op_and`') -- remove "AND" operators completely since these are the default inter-term operators anyway
    AND (PATINDEX('%[0-9a-z]%', word) > 0 OR operator LIKE 'paren%') -- only words which have 1 or more alphanumeric chars in them (unless its a parenthesis)
    ),
    tidy_up_round2 AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
    word,
    operator
    FROM tidy_up_round1
    WHERE rowid BETWEEN -- exclude any operators that are incorrectly located at the beginnning or end of the string.
    (SELECT MIN(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
    AND (SELECT MAX(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
    ),

    Next, remove consecutive & duplicate operators…

    -------------------------------------------------------------------------------------------------
    -- REMOVE CONSECUTIVE OPERATORS (LIKE "OR NOT" OR "NEAR NEAR", WHICH AREN'T VALID).
    -------------------------------------------------------------------------------------------------
    de_duped AS
    (
    SELECT a.rowid,
    a.word,
    a.operator
    FROM tidy_up_round2 a
    LEFT JOIN tidy_up_round2 b
    ON a.rowid = (b.rowid + 1) -- consecutive rowid's
    AND a.operator LIKE 'op%' -- only operators
    AND b.operator LIKE 'op%' -- only operators
    WHERE b.rowid IS NULL
    ),

    …and exclude any empty or unnecessary parentheses…

    -------------------------------------------------------------------------------------------------
    -- GET PARENTHESIS GROUPS, AND THEN THE EMPTY/UNNECESSARY PAIRS (i.e. "()" or "(blah)")
    -------------------------------------------------------------------------------------------------
    parentheses_groups AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS prowid,
    rowid,
    word,
    operator
    FROM de_duped
    WHERE operator IN ('paren_o', 'paren_x')
    OR operator LIKE 'search%'
    ),
    empty_pairs AS
    (
    SELECT a.rowid AS rowid_o,
    b.rowid AS rowid_x
    FROM parentheses_groups a
    JOIN parentheses_groups b
    ON a.operator = 'paren_o' -- opening parenthesis
    AND b.operator = 'paren_x' -- closing parenthesis
    AND (
    a.prowid = (b.prowid + 1) -- where opening and closing are consecutive rows (i.e. empty)
    OR a.prowid = (b.prowid + 2) -- where opening and closing are 1 row apart (i.e. they surround a single search term - this is unnecessary)
    )
    ),
    -------------------------------------------------------------------------------------------------
    -- NOW EXCLUDE EMPTY PARENTHESES GROUPS FROM THE FINAL RESULT SET
    -------------------------------------------------------------------------------------------------
    no_empties AS
    (
    SELECT a.rowid,
    a.word,
    a.operator
    FROM de_duped a
    LEFT JOIN empty_pairs x
    ON (a.rowid = x.rowid_o OR a.rowid = x.rowid_x)
    WHERE x.rowid_o IS NULL
    ),

    And lastly, format any wildcard (i.e. prefix) search terms, and insert the results into a table variable:

    -------------------------------------------------------------------------------------------------
    -- WILDCARD HANDLER
    -------------------------------------------------------------------------------------------------
    wildcard_handler AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
    CASE operator
    WHEN 'searchprefix' THEN
    '"' + CASE
    WHEN (LEFT(word, 1) <> '*') AND (RIGHT(word, 1) <> '*') THEN LEFT(word, CHARINDEX('*', word)-1)
    ELSE REPLACE(word, '*', '')
    END + '*"'
    ELSE word
    END AS word,
    operator
    FROM no_empties
    )
    -------------------------------------------------------------------------------------------------
    -- FINALLY, INSERT THE RESULTS INTO A TABLE VARIABLE
    -------------------------------------------------------------------------------------------------
    INSERT @tt_searchwords (rowid, word, operator)
    SELECT rowid, word, operator
    FROM wildcard_handler;

    Phew. Enough for this post I think.

    So, at this point if I pass in the following search string – sql NEAR server -(oracle | mysql)then my table variable should be populated with the following:

    Table variable contents

    In the next (and final) post, I’ll wrap up the “near” operator & double-quote handlers, and the building of the final output string.

    See ya!

    DB Dave

  • Full-Text Search Operators – Part 2: Replacing alternate operators

    Photo by natashalcdIn this post I’m going to deal with another piece of the puzzle that I described in part 1. Now, after re-reading that post it became obvious that I probably need to outline what the end result is meant to look like:

    We want to be able to pass in a search string, and get back a nicely formatted full-text search clause which makes full use of Boolean operators, respects parentheses, and supports wildcards. Easy, right? Yup, and we need to do it all using only T/SQL.

    Here’s the steps that make up the final function (in a nutshell):

    1. Remove some dodgy characters (vertical-tab, form-feed, etc)
    2. Tidy up any parentheses [part 1]
    3. Replace any alternate operators with place-holders (most operators have 2 representations; e.g. OR and |, AND and &, NOT and , etc).
    4. Break the search term into individual words (i.e. split the string on space characters)
    5. Using this list of individual words, do the following:
      1. Classify each word as an operator, search term, parenthesis, wildcard, etc.
      2. Tidy up the list, removing illegal and unnecessary characters
      3. Remove invalid operators (e.g. “ipod NOT” is not a valid search term)
      4. Remove unnecessary parentheses (e.g. single words wrapped in parentheses)
      5. Handle wildcards. Since full-text only accepts prefix searches, change all invalid wildcard searches into prefix searches.
    6. Check for single and double-quotes (& tidy up any invalid quote-marks, make sure only valid pairs remain, etc.).
    7. Process any NEAR(~) operators.
    8. Put the final Full-Text command string together.

    Right, so we’ve already taken care of number 2 in the previous post, so now on to number 3 – replacing the alternate operators.  This step may not be necessary for you, since it’s sole purpose is to preserve the original search string for our reporting purposes.

    You see, if I search for “ipad OR iphone NOT ipod” and you search for “ipad | iphone –ipod”, then obviously since they’re really the same search we will get the same results (and our cache should also cache them as 1 search), but for reporting purposes we want to know that our searches were typed in differently.  The way we do that in our implementation of this function, is that we return a normalised search string used to perform & cache the search, and an original string used for reporting.

    Here’s the very straight forward code:

    IF OBJECT_ID('[dbo].[fn_search_clause_operator_replacer]') IS NULL
    EXEC('CREATE FUNCTION [dbo].[fn_search_clause_operator_replacer] () RETURNS INT AS BEGIN RETURN(0); END ');
    GO
    /***************************************************************************************************************************
    This is a supporting function, which is called by the function [fn_search_clause_get].
    It accepts a string input, and outputs the same string, but with any character operators replaced with
    placeholder text-operators instead.
    Author : David Curlewis
    Date : 07/2009
    
    This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
    http://creativecommons.org/licenses/by-nc/3.0/
    ****************************************************************************************************************************/
    ALTER FUNCTION [dbo].[fn_search_clause_operator_replacer] (
    @search_text NVARCHAR(4000)
    ) RETURNS NVARCHAR(4000)
    AS
    BEGIN
    DECLARE @not_pattern VARCHAR(50),
    @not_pos INT;
    
    SET @search_text = REPLACE(REPLACE(REPLACE(@search_text,
    '|', ' `OR` '), -- OR
    '~', ' `NEAR` '), -- NEAR
    '&', ' `AND` '); -- AND
    
    -- the NOT operator requires some extra work because we only consider it an operator if it is not hyphenating a word
    SET @not_pattern = '%[^a-z0-9][-][ a-z0-9"''(]%'; -- valid NOT operator pattern
    SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get the positions of any valid "-" operators
    WHILE (@not_pos > 0)
    BEGIN
    SET @search_text = STUFF(@search_text, @not_pos, 2, ' `NOT` ');
    SET @not_pos = ISNULL((SELECT PATINDEX(@not_pattern, @search_text)), 0); -- get pos of next operators (if any)
    END
    
    RETURN(@search_text);
    END
    GO

    The first bit is very simple; it’s just a REPLACE statement replacing each operator’s “symbol operator” (i.e. |, ~, &) with a bit of placeholder text (i.e. `OR`, `NEAR`, `AND`).

    The NOT operators require a little more work because we don’t want to interpret all hyphens as operators, otherwise you wouldn’t be able to search for hyphenated words. Using PATINDEX we can very easily match on a simple RegEx-like pattern: [^a-z0-9][-][ a-z0-9″”(]

    This looks for any occurrences of the hyphen, where it follows a non-alphanumeric character, and precedes a space, an alphanumeric character, a double or single quote, or an opening parenthesis. It then replaces each occurrence with the placeholder text `NOT`.

    In the next post I’ll go through the meatiest bit, which involves splitting the string (number 4), and preparing the resulting data (number 5).

    Cheers

    DB Dave

  • Removing accented and illegal characters from a string

    This post isn’t related to the series I’m currently writing on Full-Text search operators, but it is in the same ball-park so I thought I’d throw it up here.

    Today I was working on our search term auto-suggest system (i.e. the system that shows a list of suggestions in a drop-down list below the search box as you type your search term in), and needed to improve the way we strip out “illegal” characters from our dictionary terms.

    Illegal characters in our case refers to pretty much anything that isn’t alphanumeric, or a space, dollar sign, ampersand, hyphen, or period.  We don’t even want any accented characters (i.e. letters with diacritics) – although in this case we want to replace accented characters with their un-accented counterparts (e.g. Å should be replaced by A).

    The way I did this (and I’m sure there are many, many better ways) is to insert illegal characters into a table variable, along with a replacement character, and then update the input variable for each row in this table variable – replacing any matching illegal characters each time with it’s replacement value.

    IF OBJECT_ID('[dbo].[fn_clean_search_term]') IS NULL
    	EXEC('CREATE FUNCTION [dbo].[fn_clean_search_term] () RETURNS INT AS BEGIN RETURN(0); END ');
    GO
    /***************************************************************************************************************************
    This accepts a string input, and outputs the same string, but with any invalid characters removed, or replaced if they
    are accented characters. Note: not all accented characters are accounted for, only the most common ones in our data.
    Also, this does not handle Unicode characters.
    
    Author : David Curlewis
    Date : 07/2009
    
    This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
    http://creativecommons.org/licenses/by-nc/3.0/
    ****************************************************************************************************************************/
    ALTER FUNCTION [dbo].[fn_clean_search_term] (
    	@keyword VARCHAR(2048)
    ) RETURNS VARCHAR(2048) WITH SCHEMABINDING
    AS
    BEGIN
    
    	-- Only do the heavy lifting if there are non-alphanumeric characters in the string
    	IF PATINDEX('%[^A-Za-z0-9 ]%', @keyword) > 0
    	BEGIN
    
    		-- if the string contains anything other than &quot;valid&quot; characters, strip them out
    		DECLARE @chars TABLE ([ascii] TINYINT NOT NULL, [new_char] VARCHAR(1));
    
    		;WITH illegal_chars AS
    		(
    			SELECT	0 AS [ascii]
    			UNION ALL
    			SELECT	[ascii] + 1
    			FROM	illegal_chars
    			WHERE	[ascii] < 255
    		)
    		INSERT	@chars ([ascii], [new_char])
    		SELECT	[ascii],
    				CASE -- Replace accented letters with non-accented letters to normalise the search terms
    				-- Unless your collation is case sensitive, the lower &amp; upper case variants are equivalent
    				WHEN CHAR([ascii]) IN ('á','à','â','ä','å') THEN 'a'
    				WHEN CHAR([ascii]) IN ('Á','Ã','Ä','Å','À','Â') THEN 'A'
    				WHEN CHAR([ascii]) IN ('ç') THEN 'c'
    				WHEN CHAR([ascii]) IN ('Ç') THEN 'C'
    				WHEN CHAR([ascii]) IN ('ê','é','ë','è') THEN 'e'
    				WHEN CHAR([ascii]) IN ('Ê','Ë','É','È') THEN 'E'
    				WHEN CHAR([ascii]) IN ('ï','í','î','ì') THEN 'i'
    				WHEN CHAR([ascii]) IN ('Í','Ì','Î','Ï') THEN 'I'
    				WHEN CHAR([ascii]) IN ('ñ') THEN 'n'
    				WHEN CHAR([ascii]) IN ('Ñ') THEN 'N'
    				WHEN CHAR([ascii]) IN ('ô','ö','ò','õ','ó','ø') THEN 'o'
    				WHEN CHAR([ascii]) IN ('Ó','Ô','Õ','Ø','Ö','Ò') THEN 'O'
    				WHEN CHAR([ascii]) IN ('š') THEN 's'
    				WHEN CHAR([ascii]) IN ('Š') THEN 'S'
    				WHEN CHAR([ascii]) IN ('ú','ü','û','ù') THEN 'u'
    				WHEN CHAR([ascii]) IN ('Ù','Ú','Ü','Û') THEN 'U'
    				WHEN CHAR([ascii]) IN ('ÿ','ý') THEN 'y'
    				WHEN CHAR([ascii]) IN ('Ÿ','Ý') THEN 'Y'
    				WHEN CHAR([ascii]) IN ('ž') THEN 'z'
    				WHEN CHAR([ascii]) IN ('Ž') THEN 'Z'
    				ELSE '' -- remove the character
    				END AS [new_char]
    		FROM	illegal_chars
    		WHERE	[ascii] NOT BETWEEN 48 AND 57 -- 0 -> 9
    		AND		[ascii] NOT BETWEEN 65 AND 90 -- A -> Z
    		AND		[ascii] NOT BETWEEN 97 AND 122 -- a -> z
    		AND		[ascii] NOT IN ( -- explicitly allow the following characters
    					32, -- <space>
    					36, -- $
    					38, -- &amp;
    					45, -- -
    					46 -- .
    					)
    		OPTION (MAXRECURSION 255);
    
    		UPDATE @chars
    		SET @keyword = REPLACE(@keyword, CHAR([ascii]), [new_char])
    
    	END
    
    	-- remove multiple spaces
    	WHILE CHARINDEX(' ', @keyword) > 0
    	SELECT @keyword = REPLACE(@keyword, ' ', ' ');
    
    	SELECT @keyword = LTRIM(RTRIM(@keyword)); -- trim leading &amp; trailing whitespace
    	SELECT @keyword = LOWER(@keyword); -- return lower-case only
    
    	RETURN @keyword;
    END
    GO

    Let me know if you have any questions, or different/better ways of doing this (yes, I know I could do it using a couple of lines of .Net in a CLR function, but I don’t have CLR enabled on this server). Smile

    Cheers

    DB Dave