Author: DB Dave

  • Snow snow snow!

    I’ve been wanting to try posting from the WordPress iPhone app, and here’s my chance! Snow! (well, some snow, and lots of slushy stuff)

    Snow in itself isn’t that exciting, but snow in Wellington IS unusual. My wife is a local and can’t remember there ever been full-on snow like this:

    20110815-170553.jpg

    20110815-170611.jpg

    Yes, it was quite interesting riding home on frozen roads – but I made it!! (obviously)

    It’s pretty cool how a bit of snow makes the whole neighborhood start acting like kids. There are adults chasing each other around having snowball fights, and the kids are all just loving it!

    Well, time for me to get back to the task of warming up…

    Cheers
    DB Dave

  • T-SQL Tuesday #21 – Temporarily permanent hacks

    This month’s T-SQL Tuesday Wednesday is hosted by Adam Machanic (blog|twitter) – who in fact started the T-SQL Tuesday concept himself back in 2009 (read more about the genesis here).

    T-SQL Tuesday #21This month’s it’s about “crap code”.
    We’ve all seen it, and we’ve all written our fair share of it.  I’m not going to concentrate too much on actual crappy code – like using deprecated syntax or functions, lots of GOTO logic within cursors (always bags of fun), etc. – but will rather look at some of the other steaming piles of “legacy” we leave behind us in the course of our busy DBA lives. None of these is necessarily going to take your production system down, or cause world war 3, but they’re bloody annoying at least.

    Some of my favourite examples of such landmines include;

    1. “Temporary” logging tables
    2. Renamed or “backup” tables
    3. Hard-coded values in stored procedures
    4. Scheduled job sprawl (or bloat)
    “I’ll just add some logging while I troubleshoot this bug.”

    Take one stored procedure, add a handful of intermittent business rule bugs, and you’ve got the perfect excuse to add a new logging table, and some logging code to the procedure. Something similar to this perhaps:

    create table dc_logging_tmp (dt datetime, m_id int, bal money);
    
    IF @member_id IN (6763, 3884, 8734) AND @ledger_balance < 0
    INSERT dc_logging_tmp (m_id, bal) VALUES(@member_id, @ledger_balance);

    Of course the table name is generic, there are no indexes, and the logging code has no comments explaining its existence – but you’ll only need it for a few hours, maybe a day or 2 at most, and then you’ll rip it all out again. Right?

    If you really only need something short-term, and there’s a chance you won’t remember to tidy up after yourself (and it won’t break anything if the table doesn’t exist), then consider creating your logging objects in TempDB.  Next time you restart your SQL service: Poof!! All gone.

    “Just to be safe I’ll create a copy of this table before we deploy this application update.”

    You’re about to do a deploy to production which involves making some data changes. So, being the sensible, risk-averse DBA that you are, you decide to take a snapshot of the current table and do a simple “SELECT * INTO” before the update.

    Of course you’ll remember to drop this temporary backup table tomorrow once you’re sure you won’t need to roll back the update, so calling it [zzz_table_name] seems reasonable. Right?  Aaah, no. There’s nothing stopping you from creating table with names like [ledger_pre_Jul2011_v3_deploy_drop_after_3_weeks].
    Okay, that might be extreme, but you get the idea.  If you feel like being especially thorough you could even make use of extended properties to add comments, descriptions, or whatever else you like.

    CREATE TABLE TableWithDescription (col INT)
    GO
    
    EXEC sp_addextendedproperty
    @name = N'Description',
    @value = 'This table is for XYZ, blah blah blah, etc...',
    @level0type = N'Schema', @level0name = 'dbo',
    @level1type = N'Table', @level1name = 'TableWithDescription';
    GO
    
    SELECT name, value
    FROM fn_listextendedproperty (
    'Description', 'Schema', 'dbo', 'Table',
    'TableWithDescription', NULL, NULL
    );
    GO

    Here’s a little more information on extended properties from Glenn Berry (twitter|blog).

    “It’ll just be easier to hard-code these hourly rate values in this procedure.”

    The thing about writing maintainable code is that it should be easy to maintain (the clue is in the name, you see). You shouldn’t need to alter a dozen stored procedures if your company’s rates change – things like that should be stored in a lookup table, or (an option I quite like) in table valued functions or views. You should obviously still test and choose what’s best suited to your particular situation.

    The main advantage of using a TVF or view, in my opinion, is that you can make use of source control and versioning (which is more difficult to do with data stored in a table). This article by Joe Celko gave me a few new ideas in this regard.

    Unfortunately I still find plenty of examples of developers including static values in stored procedures. Remember kids; every time you get lazy and hard-code something that you shouldn’t, a kitten loses it’s wings.

    “I need this stored proc to run every day. Could you just quickly create a scheduled job for me?”

    How many scheduled jobs do you have? Do you know exactly what every single one of them is doing? If you do, then I’m jealous.

    Environments that have been around for a while tend to suffer from scheduled job build-up.  This can rear its head in one of two ways; either you end up with a million separate jobs (i.e. job sprawl), or a few jobs, each with a hundred steps (i.e. job bloat).

    My advice in combatting this is to decide how you’re going to group your jobs (e.g. by schedule, function, target database, etc), and then stick to it.  Also remember to religiously maintain your job names and descriptions, schedule names, and step names – this makes your jobs essentially self documenting (you could even write a few simple queries to actually generate some documentation for you!).

    Well, that’s about enough from me on the topic of crap. I’d love to hear your feedback on any of the examples I’ve covered here, and feel free to tell me about your own crap (or the crap you’ve had to deal with).

    Cheers
    DB Dave

  • Multiple CTE gotcha

    This is probably one of those things you’ll read and think “What a dumbass, I’d never make that mistake!” – which is fine, but I thought I’d post about it nevertheless since it managed to trip me up twice in the space of a week. Smile

    I’ve got a script which grabs data from several large tables (10’s of millions of rows each), and aggregates and sanitises the data in 5 or 6 distinct steps.  I used to do this using several temp tables in the “bad old days”, but now use easier to read (to me anyway) Common Table Expressions – or CTE’s.

    This post isn’t going to tell you how to use CTE’s – there’s a pile of great stuff out there waiting for you to dig it up using your search engine of choice.

    This post will however demonstrate how 3 CTE’s managed to first trick me into thinking I was a performance tuning god, and then confound me with an “impossible to fix” bug.

    Consider the following pseudo code:

    ;with cte_1 as
    (
    select blah,
    COUNT(*),
    AVG(value)
    from a_very_big_table
    group by blah
    having AVG(value) >= 10
    ),
    cte_2 as
    (
    select some_scaler_function(blah),
    some_value,
    a_value
    from cte_1
    where some_clause
    ),
    cte_3 as
    (
    select some_values...
    from cte_1
    where some_criteria
    group by some_more_grouping
    )
    insert somewhere
    select some_stuff
    from cte_3

    I wrote something similar to this to grab some data from a large table, whittle it down with some WHERE clauses in the 1st CTE, then run some gnarly functions and more WHERE clauses in the 2nd CTE, and finally some more aggregations in a 3rd CTE before finally inserting the results into a staging table. So each CTE should reference the CTE above it.

    "Face Palm" by tarabrown
    “Doh!”

    As you can see above though, CTE_3 is referencing CTE_1 instead of CTE_2.  This basically means that SQL ignores CTE_2 (since it’s output isn’t used anywhere).  This explains why I thought I was a performance tuning god – I made a few changes to the query, and the run-time went from 6 minutes to 12 seconds! I also soon found that the data didn’t look right, and it took 45 minutes of pointless poking around before I had that forehead-smacking realisation that a typo was the cause of my pain (and my runtime was back up to 6 minutes, awww man!).

    Yes, this could be (and was) done in a single statement, but I broke it up into separate CTE’s for readability.

    Cheers
    DB Dave

  • Full-Text Search Operators – Part 4: Wrapping it up

    "iPod nano 4Gb wrapping" by Darkmere
    That’s a wrap!

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators
    Part 3: Splitting the search string

    Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two.  I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!

    Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.

    NEAR enough…

    “NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.

    In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them.  Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
    So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.

    -------------------------------------------------------------------------------------------------
    -- "NEAR" OPERATOR HANDLER
    -------------------------------------------------------------------------------------------------
    IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near')
    BEGIN
    -- reinitialise
    SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL;
    -- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...)
    DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR
    SELECT rowid
    FROM @tt_searchwords
    WHERE REPLACE(operator, '`', '') = 'op_near';
    
    OPEN near_string;
    FETCH NEXT FROM near_string INTO @i;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    -- get the rowid of the value before and after the NEAR operator row
    SELECT @o = MAX(rowid)
    FROM @tt_searchwords
    WHERE rowid &lt; @i
    AND operator NOT LIKE 'paren%';
    
    SELECT @oo = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i
    AND operator NOT LIKE 'paren%';
    
    -- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option
    IF (@o IS NULL) OR (@oo IS NULL)
    DELETE @tt_searchwords
    WHERE rowid = @i;
    ELSE BEGIN
    -- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator
    SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END
    FROM @tt_searchwords
    WHERE rowid = @o;
    
    SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')'
    FROM @tt_searchwords
    WHERE rowid = @oo;
    
    -- update the NEAR operator record with the concatenated search term
    UPDATE @tt_searchwords
    SET word = @current_word,
    operator = 'searchnear'
    WHERE rowid = @i;
    
    -- delete the records before and after the NEAR operator
    UPDATE @tt_searchwords
    SET operator = 'searchword_near'
    WHERE rowid IN (@o, @oo);
    END
    
    FETCH NEXT FROM near_string INTO @i;
    END
    
    CLOSE near_string;
    DEALLOCATE near_string;
    
    -- delete search words that are now included IN "NEAR phrases"
    DELETE @tt_searchwords
    WHERE operator = 'searchword_near';
    END

    It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!

    Builderer of awesome

    This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together.  It’s not pretty, but it works.

    -------------------------------------------------------------------------------------------------
    -- Final command string builderer of awesome
    -------------------------------------------------------------------------------------------------
    -- Need to quote any search terms that contain certain non-alphanumeric characters
    UPDATE @tt_searchwords
    SET word = CASE
    WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"'
    ELSE word END
    WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted
    -- reinitialise
    SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = '';
    
    -- get the range of row IDs that represent valid search terms and operators
    SELECT @i = MIN(rowid),
    @ii = MAX(rowid)
    FROM @tt_searchwords
    WHERE operator LIKE 'search%'
    OR operator LIKE 'paren%';
    
    WHILE @i &lt;= @ii
    BEGIN
    SELECT @final_search_temp =
    CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN ' AND NOT '
    WHEN 'op_or' THEN ' OR '
    WHEN 'paren_o' THEN ' AND ('
    WHEN 'paren_x' THEN ')'
    ELSE ' AND '
    END
    + CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN '' -- NOT
    WHEN 'op_or' THEN '' -- OR
    WHEN 'paren_o' THEN '' -- (
    WHEN 'paren_x' THEN '' -- )
    WHEN 'searchnear' THEN word -- NEAR
    WHEN 'searchphrase' THEN word -- "search phrase"
    WHEN 'searchexplicit' THEN word -- explicit search
    WHEN 'searchprefix' THEN word -- wildcard
    ELSE REPLACE(@word_syntax, '&lt;PLACEHOLDER/>', word)-- AND (default)
    END
    FROM @tt_searchwords
    WHERE rowid = @i;
    
    -- CONTAINSTABLE only accepts up to 4000 characters
    IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000
    BREAK
    ELSE
    SET @final_search = @final_search + @final_search_temp;
    
    IF @i = @ii BREAK;
    
    SELECT @i = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i;
    END
    
    -- sort out issues of operator double-ups, etc
    -- clunky but it works
    SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT ');
    SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR ');
    SET @final_search = '(' + @final_search + ')';
    SET @final_search = REPLACE(@final_search, '( AND NOT ', '(');
    SET @final_search = REPLACE(@final_search, ' AND NOT )', ')');
    SET @final_search = REPLACE(@final_search, '( AND ', '(');
    SET @final_search = REPLACE(@final_search, ' AND )', ')');
    SET @final_search = REPLACE(@final_search, '( OR ', '(');
    SET @final_search = REPLACE(@final_search, ' OR )', ')');
    SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);

    That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.

    Cheers
    DB Dave

  • Benchmarking I/O: The test rig

    Work has been a little crazy lately, but I’ve still been wanting to do some I/O performance testing since we’re due to get some new servers in soon and I wanted to be able to compare apples with oranges. So this past weekend I decided to combine something I’ve been meaning to learn more about (PowerShell) and some SQL scripts based on those described by Paul Randal (blog | twitter) in his post about testing Fusion-IO ioDrive Duo SSD’s.

    Just to be clear before going any further, I’ve tested these scripts in small scale (i.e. on my laptop) and they worked well, but I’m happy to have any bugs and/or crappy code pointed out to me in the comments. 🙂 Thanks.

    So, because I tend to over-engineer things, I laid out my requirements early on to prevent any self-induced scope-creep. These were:

    • It must accept parameters for anything the user would need to change
    • It must create the test database and any other required objects
    • It should automatically re-run the test the specified number of times (to allow averaging of results)
    • It should automatically record test results without requiring me to intervene between test runs
    • If possible, it should be able to automatically change test configurations (e.g. number of database data files) and re-run the tests without intervention

    The jury’s still out on how many of these goals I achieved, but let’s dive in anyway. I’m not going to post snippets of all the code – otherwise it’ll just get too bulky.  Instead I’ll highlight the more important bits, and then upload the whole shebang at the end for you to download.

    The PowerShell script basically controls the flow of the test process, but there are 5 SQL scripts which do the actual heavy lifting. The first 2 files create the database (with the specified number of data files, and of the specified size), as well as a few tables used in the test:

    -------------------------------------------------------------------------- 
    -- Create test database 
    --------------------------------------------------------------------------
    SET @cmd = N'
    	IF DB_ID(''' + @db_name + ''') IS NOT NULL DROP DATABASE [' + @db_name + ']; 
    	CREATE DATABASE [' + @db_name + '] ON PRIMARY (
    		NAME = N''' + @db_name + ''', 
    		SIZE = ' + CAST(@data_file_size_mb AS NVARCHAR) + 'MB , 
    		FILEGROWTH = 250MB, 
    		FILENAME = N''' + @db_data_path + @db_name + '.mdf'')'; 
    
    -- Add additional files 
    WHILE @i < $(NoOfDataFiles) 
    BEGIN 
    	SET @cmd = @cmd + CHAR(10) + ',(NAME = N''' + @db_name + CAST(@i AS NVARCHAR) 
    		+ ''', SIZE = ' + CAST(@data_file_size_mb AS NVARCHAR) + 'MB , FILEGROWTH = 250MB, FILENAME = N''' 
    		+ @db_data_path + @db_name + CAST(@i AS NVARCHAR) + '.ndf'')'; 
    		
    	SET @i = @i + 1; 
    END 
    
    SET @cmd = @cmd + CHAR(10) + ' LOG ON (NAME = N''' + @db_name + '_log'', SIZE = ' 
    	+ CAST(CASE WHEN $(TargetDbSizeGB) <= 1 THEN 128 WHEN $(TargetDbSizeGB) > 1 AND $(TargetDbSizeGB) <= 5 
    		THEN 512 WHEN $(TargetDbSizeGB) > 5 AND $(TargetDbSizeGB) <= 10 THEN 2048 ELSE 4096 END AS NVARCHAR) 
    		+ 'MB, FILEGROWTH = 100MB, FILENAME = N''' + @db_log_path + @db_name + '_log.ldf''); ALTER DATABASE [' 
    		+ @db_name + '] SET RECOVERY SIMPLE;'; 
    
    EXEC sp_executesql @cmd, N'';

    The main work table is a copy & paste from Paul’s post mentioned earlier, and has a clustered index on a GUID column which forces random I/O. I guess you could change this to an integer/identity column to change this to sequential I/O instead.

    The 3rd SQL script performs the inserts, and is copied from a related post from Thomas Kejser (blog) – I just added a bit that works out how many rows to insert based on the target database size specified.

    SELECT @counter = 0, @start = GETDATE(); 
    WHILE (@counter < @no_of_rows_to_insert_per) 
    BEGIN 
    	IF @counter % 1000 = 0 
    	BEGIN TRAN; 
    		INSERT INTO MyBigTable DEFAULT VALUES; 
    		SET @counter = @counter + 1; 
    		
    		IF @counter % 1000 = 999 
    			COMMIT TRAN; 
    END; 
    
    SET @end = GETDATE(); 
    INSERT INTO Results (duration) VALUES (CAST(DATEDIFF(SS, @start, @end) AS INT));

    The 4th script is directly from this post of Paul’s on wait stats – its definitely one worth adding to your toolbox.

    Finally, the 5th script aggregates each test run’s results into one row, so that the test rig can automatically blow everything away and run again without losing any results.

    One of the biggest challenges I faced (as the PowerShell n00b that I am) was figuring out how to asynchronously launch multiple processes and wait for them to finish.  In “batch file land” I would’ve just thrown together some “FOR %%i IN… DO START sqlcmd….” magic, but I wouldn’t allow myself to use anything but PowerShell for the flow control.  After some time spent searching around I found Start-Job which allowed me to launch a given number of instances of SQLCMD and then wait for them to all finish by calling Wait-Job (obviously!). Winking smile

    ###################################################################### 
    # Kick off the specified number of insert processes ($NoOfInsertProcesses) 
    ###################################################################### 
    $sj = { param($SQLInstanceName, $TestDBName, $InsertRowsPath, $TargetDbSizeGB, $ApproxRowByteSize, $NoOfInsertProcesses) sqlcmd -S "$SQLInstanceName" -d "$TestDBName" -i "$InsertRowsPath" -v TargetDbSizeGB=$TargetDbSizeGB ApproxRowByteSize=$ApproxRowByteSize NoOfInsertProcesses=$NoOfInsertProcesses } $(for ($x=1; $x -le $NoOfInsertProcesses; $x++){ Start-Job $sj -ArgumentList $SQLInstanceName, $TestDBName, $InsertRowsPath, $TargetDbSizeGB, $ApproxRowByteSize, $NoOfInsertProcesses } ) | Wait-Job #Wait for all insert processes to finish before continuing

    Funnily enough, one of my main “Aha!” moments was had while reading this post by Aaron Bertrand (blog | twitter) who isn’t your typical hard-core PowerShell or Windows Server blogger, but instead a SQL Server MVP – so either it was a sign I was on the right path, or Google knows far too much about my reading habits. Smile

    Okay – please feel free to download the scripts, tear them apart, point and laugh, etc (although, if you point and laugh, please at least tell me why so I can learn from it). As I said before, I haven’t run them on anything apart from my laptop (which I think would’ve taken several years to perform the full-blown test), so I’d be interested to hear of your experiences.

    IOTestScripts

    In the coming weeks I’m hoping to benchmark several of our existing servers (sitting on top of various storage platforms, such as SAN – both solid state & fibre channel, local disk, NAS, and RAM disks).

    Cheers,

    DB Dave

  • Full-Text Search Operators – Part 3: Splitting the search string

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators

    We’ve now got to the point where we have a “clean” search string.  The next step is to split the string into rows which we can then manipulate without having to traverse the string itself.

    There are many “split to row” table-valued functions out there, and I’d recommend using a CLR-based one if you can (for performance reasons).  Basically the function accepts an input string and a delimiter, and outputs a table of individual search terms.

    Instead of simply inserting these rows into a temporary table or table variable at this stage, I first use a series of CTE’s to further refine the data.

    -------------------------------------------------------------------------------------------------
    -- SPLIT SEARCH STRING INTO INDIVIDUAL SEARCH WORDS
    -------------------------------------------------------------------------------------------------
    ;WITH
    banana_split AS
    (
    SELECT id AS rowid,
    LTRIM(RTRIM(data)) as word
    FROM [dbo].[fn_split_to_rows](@search_text, ' ') -- whatever split method you use, make sure it retains the case if you rely on UPPER-CASE operators
    WHERE LEN(LTRIM(RTRIM(data))) BETWEEN 1 AND 100 -- any single "word" longer than 100 chars is likely rubbish (http://en.wikipedia.org/wiki/Longest_word_in_English) ;-)
    ),

    This then feeds into a 2nd CTE which classifies each individual search term:

    words_classified AS
    (
    -- ********************************************************************************************
    -- The order of the CASE clauses below is important in deciding the order of preference
    -- given to the different special characters. For example, if a string contains both an
    -- asterisk and double quotes (e.g. "search*term") then the operator that appears first in
    -- the CASE statement below will take affect (e.g. either "search term" or search*).
    -- ********************************************************************************************
    SELECT rowid,
    word,
    CASE
    WHEN word = '(' THEN 'paren_o' -- Opening parenthesis
    WHEN word = ')' THEN 'paren_x' -- Closing parenthesis
    WHEN LEFT(word,1) = '"' OR LEFT(word,1) = '''' THEN 'search_dq_o' -- An opening exact phrase-quote
    WHEN RIGHT(word,1) = '"' OR RIGHT(word,1) = '''' THEN 'search_dq_x' -- A closing exact phrase-quote
    WHEN LEFT(word,1) = '"' AND RIGHT(word,1) = '"' THEN 'searchexplicit' -- An exact term search (i.e. "blah")
    WHEN LEFT(word,1) = '''' AND RIGHT(word,1) = '''' THEN 'searchexplicit' -- An exact term search (i.e. 'blah')
    WHEN LEFT(word,1) = '+' THEN 'searchexplicit' -- An exact term search (i.e. +blah)
    -- We explicitly use a case-sensitive (binary) collation to ensure case sensitivity for operators.
    WHEN (word COLLATE Latin1_General_BIN2) = ('AND' COLLATE Latin1_General_BIN2) THEN 'op_and' -- AND operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('OR' COLLATE Latin1_General_BIN2) THEN 'op_or' -- OR operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('NEAR' COLLATE Latin1_General_BIN2) THEN 'op_near' -- NEAR operator
    WHEN (word COLLATE Latin1_General_BIN2) = ('NOT' COLLATE Latin1_General_BIN2) THEN 'op_not' -- NOT operator
    WHEN word = '`AND`' THEN 'op_and`' -- AND operator (&)
    WHEN word = '`OR`' THEN 'op_or`' -- OR operator (|)
    WHEN word = '`NEAR`' THEN 'op_near`' -- NEAR operator (~)
    WHEN word = '`NOT`' THEN 'op_not`' -- NOT operator (-)
    WHEN CHARINDEX('*', word) > 0 THEN 'searchprefix' -- Wildcard operator (*)
    ELSE 'searchword' END AS operator
    FROM banana_split
    ),

    I wanted the search operators to be case sensitive in order not to change the behaviour of searches where people were including the words “and”, “or”, “not”, or “near”. This means that I needed to specify a case-sensitive collation for the string comparisons (obviously you can remove this if you don’t want case-sensitivity, or you database already uses a case-sensitive collation).

    Now another CTE tidies up by removing some superfluous characters, etc.:

    -------------------------------------------------------------------------------------------------
    -- NOW TIDY UP THE WORDS AND OPERATORS
    -------------------------------------------------------------------------------------------------
    tidy_up_round1 AS
    (
    -- tidy up some of the words
    SELECT rowid,
    REPLACE(
    CASE
    WHEN operator = 'searchexplicit' THEN REPLACE(word, '+', '') -- remove +'s FROM explicit searches
    WHEN operator = 'search_dq_o' THEN RIGHT(word, LEN(word)-1) -- remove the opening quote
    WHEN operator = 'search_dq_x' THEN LEFT(word, LEN(word)-1) -- remove the closing quote
    ELSE word END,
    '"', '') AS word, -- we can just kill all double-quotes now (any that are in the middle of words are not valid)
    operator
    FROM words_classified
    WHERE operator NOT IN ('op_and', 'op_and`') -- remove "AND" operators completely since these are the default inter-term operators anyway
    AND (PATINDEX('%[0-9a-z]%', word) > 0 OR operator LIKE 'paren%') -- only words which have 1 or more alphanumeric chars in them (unless its a parenthesis)
    ),
    tidy_up_round2 AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
    word,
    operator
    FROM tidy_up_round1
    WHERE rowid BETWEEN -- exclude any operators that are incorrectly located at the beginnning or end of the string.
    (SELECT MIN(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
    AND (SELECT MAX(rowid) FROM tidy_up_round1 WHERE REPLACE(operator, '`', '') NOT IN ('op_not', 'op_near', 'op_or'))
    ),

    Next, remove consecutive & duplicate operators…

    -------------------------------------------------------------------------------------------------
    -- REMOVE CONSECUTIVE OPERATORS (LIKE "OR NOT" OR "NEAR NEAR", WHICH AREN'T VALID).
    -------------------------------------------------------------------------------------------------
    de_duped AS
    (
    SELECT a.rowid,
    a.word,
    a.operator
    FROM tidy_up_round2 a
    LEFT JOIN tidy_up_round2 b
    ON a.rowid = (b.rowid + 1) -- consecutive rowid's
    AND a.operator LIKE 'op%' -- only operators
    AND b.operator LIKE 'op%' -- only operators
    WHERE b.rowid IS NULL
    ),

    …and exclude any empty or unnecessary parentheses…

    -------------------------------------------------------------------------------------------------
    -- GET PARENTHESIS GROUPS, AND THEN THE EMPTY/UNNECESSARY PAIRS (i.e. "()" or "(blah)")
    -------------------------------------------------------------------------------------------------
    parentheses_groups AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS prowid,
    rowid,
    word,
    operator
    FROM de_duped
    WHERE operator IN ('paren_o', 'paren_x')
    OR operator LIKE 'search%'
    ),
    empty_pairs AS
    (
    SELECT a.rowid AS rowid_o,
    b.rowid AS rowid_x
    FROM parentheses_groups a
    JOIN parentheses_groups b
    ON a.operator = 'paren_o' -- opening parenthesis
    AND b.operator = 'paren_x' -- closing parenthesis
    AND (
    a.prowid = (b.prowid + 1) -- where opening and closing are consecutive rows (i.e. empty)
    OR a.prowid = (b.prowid + 2) -- where opening and closing are 1 row apart (i.e. they surround a single search term - this is unnecessary)
    )
    ),
    -------------------------------------------------------------------------------------------------
    -- NOW EXCLUDE EMPTY PARENTHESES GROUPS FROM THE FINAL RESULT SET
    -------------------------------------------------------------------------------------------------
    no_empties AS
    (
    SELECT a.rowid,
    a.word,
    a.operator
    FROM de_duped a
    LEFT JOIN empty_pairs x
    ON (a.rowid = x.rowid_o OR a.rowid = x.rowid_x)
    WHERE x.rowid_o IS NULL
    ),

    And lastly, format any wildcard (i.e. prefix) search terms, and insert the results into a table variable:

    -------------------------------------------------------------------------------------------------
    -- WILDCARD HANDLER
    -------------------------------------------------------------------------------------------------
    wildcard_handler AS
    (
    SELECT ROW_NUMBER() OVER(ORDER BY rowid) AS rowid,
    CASE operator
    WHEN 'searchprefix' THEN
    '"' + CASE
    WHEN (LEFT(word, 1) <> '*') AND (RIGHT(word, 1) <> '*') THEN LEFT(word, CHARINDEX('*', word)-1)
    ELSE REPLACE(word, '*', '')
    END + '*"'
    ELSE word
    END AS word,
    operator
    FROM no_empties
    )
    -------------------------------------------------------------------------------------------------
    -- FINALLY, INSERT THE RESULTS INTO A TABLE VARIABLE
    -------------------------------------------------------------------------------------------------
    INSERT @tt_searchwords (rowid, word, operator)
    SELECT rowid, word, operator
    FROM wildcard_handler;

    Phew. Enough for this post I think.

    So, at this point if I pass in the following search string – sql NEAR server -(oracle | mysql)then my table variable should be populated with the following:

    Table variable contents

    In the next (and final) post, I’ll wrap up the “near” operator & double-quote handlers, and the building of the final output string.

    See ya!

    DB Dave