Tag: script

  • What’s in your toolbox?

    Toolbox Lock - by Jenn Durfey

    I’ve had a few weeks off work recently, during which time I decided to tidy up (or attempt to anyway) my collection of “utility scripts”.  This consists of about 95% regular T-SQL scripts, with the rest being a mixture of PowerShell scripts, batch files, etc.

    This made me wonder how other people managed their collections, and whether their are better ways (or tools) to catalogue and easily retrieve useful scripts and snippets?

    What I do

    My solution is to use the built in SQL Server Management Studio Template Explorer.  I found a few good blog posts explaining where the templates are stored, and some different ways of getting the most out of them (for example, I went down the route of creating symbolic links so I could have one central set of scripts stored on my Skydrive, which I have access to from any computer).

    I was thinking about taking this one step further; I’m the database team lead at work, and it might be useful to collaboratively compile a list of scripts that each of us find useful/indispensable, which we could then add to our source control repository.  We could then each have symbolic links to this repository on our own computers, meaning the whole team would automatically have access to the latest versions of our “über team toolbox” scripts.

    What’s in my toolbox?

    My SSMS Templates

    I’ve got a bunch of stuff… and I’ve still got a lot to categorise and move into my templates folder, but it’s a mix of some pretty well known utility scripts from some equally well known “SQL personalities”, as well as a lot of stuff I’ve written over the years and tucked aside for a rainy day, or scripts that I regularly need for the applications that we support.

    My “well known” utility scripts currently include:

    I have others that fall into this category (e.g. some queries from Glen Berry) that I’m still tidying up and organising.  The SQLServerPedia site also has a list of useful scripts which I’ve bookmarked for later follow-up.

    What else is there?

    Well, if you don’t think using templates is going to work for you, then there are a few other options (although not as many as I would’ve thought – hence the reason for this post)…

    Snippets

    I think of snippets as mini, or “in-line” templates.  The reason being most snippets are fragments of a larger script – for example you can have a snippet to insert a common stored procedure header block that you always use, or the syntax for a cursor declaration.  My templates tend to be more full-featured scripts, but I’m sure these could be stored as snippets too.  Here’s an article on snippets, including how to create your own.

    Talking about snippets – it’s easy to confuse the built-in SSMS snippets (which is what I was talking about above) with those included in Mladen Prajdić’s very cool SSMS Tools Pack add-in.  In SSMS Tools Pack they’re more like fully customisable shortcuts to code snippets.  This means you don’t need to have an extra window open to select the snippet, you can just (for example) type ssf and hit enter, and it expands out to SELECT * FROM automagically. 😉

    SQL Scripts Manager

    This is a free tool from Red Gate, which to be honest I haven’t played with very much yet (it’s still on my to-do list).  It seems to offer some cool features (like tagging scripts, a list of authors, etc), and comes bundled with a range of very useful scripts to get you started.  Here’s a write-up on it.

    I’ll definitely take a closer look at it when I get a chance – I’m interested to see where it stores the scripts (i.e. whether I can plug it into a central source control repository), and also whether there is any SSMS integration (which would be way cooler than just an external app).

    Good ol’ Explorer

    I’m guessing this is still the most common method; just putting a bunch of snippets and scripts into a directory (or, if you’ve got some semblance of order, perhaps a categorised series of directories).

    If this works for you (as it did me for many years), and you can still easily find what you need, then I guess there’s no real reason to change. 😉

    What do you use?

    So, what method do you use to keep your scripts library in order, and do you have any other cool scripts you’d recommend I add to my library?  I’d love to hear from you via the comments section below.

    Cheers,
    DB Dave

  • SQL script of HTML status codes

    HTML Codes SQL Script

    Bit of a random post; I couldn’t find anything similar online, so created this script and thought I’d share it in case anyone randomly needs a lookup/dimension table of HTML status codes. All pulled directly from Wikipedia. 🙂

  • CURSORs and WHILE-Loops

    can of wormsYes, I know this is a giant (some might say pointless, and I wouldn’t totally disagree) can of worms I’m opening here, but stay with me for a minute

    Why?

    We were discussing this at work a few months ago, and I was one of those people who followed “conventional wisdom” which says that SQL cursors are evil, and should be replaced with WHILE-loops if a set-based alternative isn’t possible (or practical).

    So I thought I’d do some testing for myself.  I know this has been done more than once before (I won’t mention any articles specifically, but they’re easy to find using your search engine of choice) – but I like to figure things out for myself rather than read a blog and take the author’s word as infallible.  Some of the other tests I found online also only concentrated on one specific metric when measuring performance (e.g. physical I/O), or were too obviously biased for my taste.

    Living in the Real World

    There are also what I refer to as “non-real-world” arguments against one or the other, for example:

    • Cursors are best because they’re easier/quicker to write (usually less variables and/or temp tables required).
    • While-loops are best because they’re inherently faster, since they don’t have to make use any of the internal mechanisms that cursors use to allow backwards and forwards data-set traversal.

    The reason I call these “non-real-world” problems, is that I don’t think any DBA or developer worth their salt is going to worry too much about how many more characters one takes to type than the other, and I’ve very seldom (if ever) needed to use anything but a simple forward-only cursor – so these problems aren’t representative of the majority of use-cases in my experience.

    Obviously if you take something like this super-simplified while-loop below, it would be difficult to write a cursor that would perform faster – but why the hell would you?

    DECLARE @i INT = 0;
    WHILE @i <= 1000000 SET @i += 1; -- 405ms on my laptop

    What?

    Alright, down to business. Using the AdventureWorks database on my laptop (running 2008R2 build 10.50.2500.0), I wrote a simple test script which iterates through each SalesOrderHeader row (for online orders) and counts the number of SalesOrderDetail rows for each SalesOrderID.  I know… for someone who was just talking about “non-real-world” examples, this is pretty unrealistic – but the point is that it generates a measureable amount of I/O and CPU load, and I can reproduce the logic easily using both a while-loop and a cursor.

    Here’s the code for the while-loop test:

    ------------------------------------------------
    -- WHILE LOOP (using counter)
    ------------------------------------------------
    CHECKPOINT;
    DBCC DROPCLEANBUFFERS;
    GO
    SET NOCOUNT ON;
    GO
    IF OBJECT_ID('tempdb..#SOID') IS NOT NULL DROP TABLE #SOID;
    CREATE TABLE #SOID (ID INT IDENTITY, SalesOrderID INT NOT NULL);
    DECLARE @i INT = 1, @ii INT, @SalesOrderID INT, @LineCount INT;
    
    INSERT #SOID (SalesOrderID)
    SELECT SalesOrderID
    FROM Sales.SalesOrderHeader
    WHERE OnlineOrderFlag = 1;
    
    SET @ii = @@ROWCOUNT;
    
    WHILE @i &amp;lt;= @ii
    BEGIN
    SET @SalesOrderID = (SELECT SalesOrderID FROM #SOID WHERE ID = @i);
    
    SET @LineCount = ISNULL(@LineCount, 0) + ISNULL((
    SELECT COUNT(*)
    FROM Sales.SalesOrderDetail
    WHERE SalesOrderID = @SalesOrderID
    ),0);
    
    SET @i += 1;
    END
    IF @LineCount &amp;lt;&amp;gt; 60398
    BEGIN
    RAISERROR('Bad Total',16,1);
    PRINT @LineCount;
    END
    GO

    Even for a simple while-loop there are multiple ways to write the logic – I’ve chosen a simple counter here, but didn’t see massive variances in the performance of the 2 or 3 methods I tested.

    Here’s the code I used to test cursors:

    ------------------------------------------------
    -- CURSOR
    ------------------------------------------------
    --DECLARE cursor_name CURSOR [ LOCAL | GLOBAL ]
    -- [ FORWARD_ONLY | SCROLL ]
    -- [ STATIC | KEYSET | DYNAMIC | FAST_FORWARD ]
    -- [ READ_ONLY | SCROLL_LOCKS | OPTIMISTIC ]
    -- [ TYPE_WARNING ]
    CHECKPOINT;
    DBCC DROPCLEANBUFFERS;
    GO
    SET NOCOUNT ON;
    GO
    DECLARE @SalesOrderID INT, @LineCount INT;
    DECLARE SOID CURSOR TYPE_WARNING
    FOR
    SELECT SalesOrderID
    FROM Sales.SalesOrderHeader
    WHERE OnlineOrderFlag = 1;
    
    OPEN SOID;
    FETCH NEXT FROM SOID INTO @SalesOrderID;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    SET @LineCount = ISNULL(@LineCount,0) + ISNULL((
    SELECT COUNT(*)
    FROM Sales.SalesOrderDetail
    WHERE SalesOrderID = @SalesOrderID
    ), 0);
    
    FETCH NEXT FROM SOID INTO @SalesOrderID;
    END
    CLOSE SOID;
    DEALLOCATE SOID;
    IF @LineCount &lt;&gt; 60398
    BEGIN
    RAISERROR('Bad Total',16,1);
    PRINT @LineCount;
    END
    GO

    This represents the simplest test (i.e. just a plain cursor declaration with no options other than TYPE_WARNING).  I used this as a baseline, to which I added various combinations of cursor options, such as STATIC vs DYNAMIC, LOCAL vs GLOBAL, etc.

    To measure performance I profiled the execution of the main batch (i.e. not including the checkpoint, clearing the caches, etc) and recorded CPU, Duration, Reads, & Writes over 5 executions per test run (I actually ran each more than just 5 times, but results were so consistent that I only recorded the last 5 executions of each test).

    Show me the money

    Cursor DurationThe surprising thing for me (in this test case) was how poorly the while-loop performed.  The graph to the right shows the various cursor-based test runs, all of which completed in less than 1 second (click the image to viewer a larger copy).

    The while-loop consistently took 30 seconds to run.

    By the way, please feel free to point out any mistakes I’ve made.

    Average ReadsThe reads show a similar story regarding the stark difference between the while-loop and any of the cursor tests.  The cursor read I/O figures are in the graph to the right, and the while-loop averaged well over 1.8 million reads (it dwarfed the others on the graph so I removed it).

    I guess what I find interesting is that many people have run tests, and look at logical or physical I/O, but don’t take into account CPU or overall duration.  If my data is sitting in memory anyway, I don’t really care about those logical reads – but the difference between a query taking 100 vs 1000 ms of my CPU’s time is very important to me.  Am I neglecting to take anything else (important) into account?

    Conclusion

    I don’t have one…

    These tests are too fickle to make any concrete observations – you could change just a few lines of code and get completely different results to what I’m seeing.

    Personally, I will continue to use both where appropriate.  Before anyone has a heart attack let me clarify that I would always first use a set-based solution before resorting to either.

    But… my rule of thumb is that I use CURSORS if I’m performing data manipulation, and a WHILE loop if I’m performing an action until some condition is met.  For cursors, I tend to use the “LOCAL STATIC FORWARD_ONLY READ_ONLY” options, just because they roll off the tongue so nicely. If you’re on the SQL Server team at Microsoft and you’re reading this – in the next version of SQL could you please consolidate these options down to a single “GO_BABY_GO” option instead? Winking smile

    As a final note, highlighting why a set-based alternative is always better; running this query “properly” (i.e. set-based, not RBR) gave me just over 900 reads, and completed in 29 milliseconds. No contest really.

    Cheers,

    Dave

  • Preventing identity overflow

    Here’s a stored procedure I put together a week or so ago that I thought others might find useful.

    Facepalm

    We had an issue which should really never be allowed to happen (but as they say, “shit happens”); a rather important table ran out of integer values for the identity column.  Luckily the actual value of the ID column on this table isn’t important because the table only stores the most recent month’s data, and everything before that is archived – so we could simply reseed the identity value back to 0 (which had long since been purged from this table).

    To prevent this in future I put together a pretty simple query to check all identity values, and check them against the maximum value for their given data types. This eventually became a procedure which is now just part of my “utility” database which I deploy to every production server.

    IF OBJECT_ID('dbo.usp_check_max_identity_value') IS NULL
    EXEC('CREATE PROC dbo.usp_check_max_identity_value AS ');
    GO
    ALTER PROC dbo.usp_check_max_identity_value (
    @alert_threshold TINYINT = 60, -- The percentage of the max value over which we will alert.
    @send_alert_as_email BIT = 0, -- Do you want to send any alerts as an email? (Well... do ya, punk?)
    @alert_email_recipients NVARCHAR(MAX) = NULL -- Semicolon-delimited list of email recipients.
    ) AS
    BEGIN
    SET NOCOUNT ON;
    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
    
    DECLARE @srv_name VARCHAR(200),
    @db_name VARCHAR(200),
    @table_name VARCHAR(200),
    @column_name VARCHAR(200),
    @data_type VARCHAR(200),
    @max_value BIGINT,
    @max_tinyint TINYINT,
    @max_smallint SMALLINT,
    @max_int INT,
    @max_bigint BIGINT,
    @cmd NVARCHAR(4000),
    @percent_of_max INT,
    @alert_subject NVARCHAR(255),
    @alert_body NVARCHAR(MAX),
    @alert_counter INT;
    
    DECLARE @tt_alert_data TABLE (
    row_id INT IDENTITY,
    [db_name] VARCHAR(200),
    table_name VARCHAR(200),
    column_name VARCHAR(200),
    data_type VARCHAR(200),
    percent_of_max INT
    );
    
    -- These are the max values of each integer data type
    SELECT @max_tinyint = 255,
    @max_smallint = 32767,
    @max_int = 2147483647,
    @max_bigint = 9223372036854775807;
    
    -- Get a list of all available user databases to loop through
    DECLARE db_cursor CURSOR LOCAL STATIC FORWARD_ONLY READ_ONLY FOR
    SELECT DISTINCT
    QUOTENAME(name)
    FROM sys.databases
    WHERE database_id &gt; 4
    AND is_read_only = 0
    AND state_desc = 'ONLINE'
    
    OPEN db_cursor;
    FETCH NEXT FROM db_cursor INTO @db_name;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    
    -- Get a list of all user tables in the current database
    SET @cmd = N'
    DECLARE tables_cursor CURSOR GLOBAL STATIC FORWARD_ONLY READ_ONLY FOR
    SELECT DISTINCT
    QUOTENAME(SCHEMA_NAME(t.[schema_id])) + ''.'' + QUOTENAME(t.name) AS table_name,
    QUOTENAME(c.name) AS column_name,
    typ.name AS data_type
    FROM ' + @db_name + '.sys.tables t
    JOIN ' + @db_name + '.sys.columns c ON t.[object_id] = c.[object_id]
    JOIN ' + @db_name + '.sys.types typ ON c.system_type_id = typ.system_type_id
    WHERE t.[type] = ''U''
    AND t.is_ms_shipped = 0
    AND c.is_identity = 1
    AND typ.name IN (''tinyint'',''smallint'',''int'',''bigint'')
    ORDER BY QUOTENAME(SCHEMA_NAME(t.[schema_id])) + ''.'' + QUOTENAME(t.name);
    ';
    
    EXEC sp_executesql @cmd, N'';
    
    -- Loop through the list of tables
    OPEN tables_cursor;
    FETCH NEXT FROM tables_cursor INTO @table_name, @column_name, @data_type;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    SELECT @max_value = NULL, @percent_of_max = NULL;
    SET @cmd = N'SELECT @max = MAX(' + @column_name + ') FROM ' + @db_name + '.' + @table_name + ' WITH (NOLOCK);';
    
    -- Run the SELECT to fetch the maximum value from the IDENTITY column
    EXEC sp_executesql @cmd,
    N'@max BIGINT OUTPUT',
    @max = @max_value OUTPUT;
    
    IF @max_value IS NULL GOTO FetchNext;
    
    -- Work out the percentage this represents of the maximum posible value
    SELECT @percent_of_max = CAST(CASE
    WHEN @data_type = 'tinyint' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_tinyint AS FLOAT)) * 100.0)
    WHEN @data_type = 'smallint' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_smallint AS FLOAT)) * 100.0)
    WHEN @data_type = 'int' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_int AS FLOAT)) * 100.0)
    WHEN @data_type = 'bigint' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_bigint AS FLOAT)) * 100.0)
    END AS INT)
    
    -- If the percentage is greater than our threshold, raise an alert
    IF (@percent_of_max &gt; @alert_threshold)
    BEGIN
    -- Insert the values for use later (if we're going to send an email)
    IF (@send_alert_as_email = 1)
    BEGIN
    INSERT @tt_alert_data ([db_name], table_name, column_name, data_type, percent_of_max )
    VALUES (@db_name, @table_name, @column_name, @data_type, @percent_of_max);
    END
    
    RAISERROR(
    '%s - Identity column %s on table %s is currently at %i%% of an %s''s maximum value.',
    10,
    1,
    @db_name,
    @column_name,
    @table_name,
    @percent_of_max,
    @data_type
    ) WITH LOG;
    
    SET @alert_counter = ISNULL(@alert_counter,0) + 1;
    END
    
    FetchNext:
    FETCH NEXT FROM tables_cursor INTO @table_name, @column_name, @data_type;
    END
    
    CLOSE tables_cursor;
    DEALLOCATE tables_cursor;
    
    FETCH NEXT FROM db_cursor INTO @db_name;
    END
    
    CLOSE db_cursor;
    DEALLOCATE db_cursor;
    
    -- If we want to send an alert email (and there is one to send), then let's do that...
    IF (@send_alert_as_email = 1) AND (ISNULL(@alert_counter, 0) &gt; 0)
    BEGIN
    -- If we haven't provided an email recipient, just carry on regardless
    IF @alert_email_recipients IS NULL RETURN;
    
    SET @srv_name = QUOTENAME(CAST(@@SERVERNAME AS VARCHAR(200)));
    SET @alert_subject = @srv_name + ' - Identity value overflow warning - ' + CAST(@alert_counter AS VARCHAR) + ' columns found.';
    
    -- Build the email body (HTML tags for the benefit of making the email pretty, but not necessary)
    set @alert_body = '&lt;html&gt;&lt;body&gt;&lt;style type="text/css"&gt; h3, p, table {font-family:verdana;}&lt;/style&gt;';
    set @alert_body = @alert_body +
    N'&lt;H3&gt;Identity value overflow warning&lt;/H3&gt;' +
    N'&lt;p&gt;This alert means that the following tables on the server ' + @srv_name
    + ' have integer-based IDENTITY column values which are currently greater than &lt;strong&gt;'
    + CAST(@alert_threshold AS VARCHAR) + '%&lt;/strong&gt; of their respective data-types'' maximum possible values.&lt;/p&gt;' +
    N'&lt;p&gt;This is just a warning, but you may want to consider increasing the size of the data-type you use (if possible) '
    + 'or reseeding the table (if the IDENTITY value itself is not important, other than it being unique).&lt;/p&gt;' +
    N'&lt;table border="1" cellspacing="0" cellpadding="3"&gt;' +
    N'&lt;tr style="background-color:#c9cfff;"&gt;&lt;th&gt;Database&lt;/th&gt;&lt;th&gt;Table&lt;/th&gt;&lt;th&gt;Column&lt;/th&gt;&lt;th&gt;Type&lt;/th&gt;&lt;th&gt;%&lt;/th&gt;&lt;/tr&gt;' +
    CAST ( ( SELECT td = [db_name], '',
    td = [table_name], '',
    td = [column_name], '',
    td = [data_type], '',
    td = [percent_of_max]
    FROM @tt_alert_data
    ORDER BY row_id
    FOR XML PATH('tr'), TYPE
    ) AS NVARCHAR(MAX)) +
    N'&lt;/table&gt;&lt;/body&gt;&lt;/html&gt;';
    
    -- Send the email (NB: This assumes that you have a default Database Mail profile configured)
    EXEC msdb.dbo.sp_send_dbmail
    @recipients = @alert_email_recipients,
    @subject = @alert_subject,
    @body = @alert_body,
    @body_format = 'HTML',
    @importance = 'High';
    END
    END
    GO

    It’s not very complex – it iterates through all user databases on the server, and then over all user tables that have an identity column, and simply selects the maximum values from each of those identity columns.  This max value is then compared to the max value for the respective data type of the column (i.e. tinyint, smallint, int, or bigint).

    What it does after that is up to you. In this case I send an email with a nicely formatted HTML email displaying the offending columns.  Obviously you can customise what action to take, and what your threshold is. And yes, the title of this post is a little misleading since this doesn’t actually prevent anything, but rather alerts you to impending doom.

    Warning: Before running this on a production system, make sure that you’re happy for it to be selecting from all of your (potentially large and/or busy) tables that have identity columns.  If you have parallel test/dev systems, then run them there instead to prevent any negative performance impact in production. I also take no responsibility for bugs or unexpected behaviour. Consult your GP before taking.

    Cheers

    DB Dave

  • SQL Server word-level bigram function

    I work for an online auction website, Trade Me, and as such collect quite a large quantity of search data (i.e. full text searches that people perform).  Over the years I’ve been involved in various aspects of our search; from tweaking the nuts and bolts of the full-text search itself (see my blog posts about boolean search operators for example) to working on our recommendation algorithm, and implementing search auto-suggest.

    Something I’m looking into at the moment is changing the way we recommend alternative search terms when we think you’ve misspelt something – i.e. our “did-you-mean” system.

    Did you mean…?

    The way we currently do it is based on taking the individual words entered, and generating a large number of alternative spellings for each of them, and then comparing those lists of words against a “known-good” list of search terms and returning the most likely one for each word entered (according to a magic score based on frequency, average results per search, etc).  This is done on the fly and is surprisingly quick (it’s all done within TSQL functions and procedures), but I want to experiment with word level n-grams (unigrams, bigrams, and trigrams specifically).

    Basically, this involves taking each search string and splitting it into individual words to give you your unigrams, or then pairing them up (in the case of bigrams).

    For example, the string “pink ipod nano” would be returned as:

    Word 1Word 2
    #pink
    pinkipod
    ipodnano
    nano#

    The hash symbols are delimiters and are sometimes not used, in which case the 1st and last rows of this table wouldn’t be returned. In my function below, I use a less common symbol (the “Eszett”) as my delimiter, and have a parameter to allow it to be omitted if necessary.

    What this allows you to do is relatively easily store a large set of known n-grams and their occurrence frequencies (i.e. how common they are), and then predict either what word the user is likely to type next (in the case of an auto-suggestion system), or what the user actually meant to type in the case of a misspelling.

    For example, if a user types “ipod” followed by a space into the search box, then given a large enough working dataset you should be able to predict what the next word is likely to be (even more so as they start typing the first few letters).  Or, if the same user types in “ipod nqno“, then we should be able to work out that firstly “nqno” isn’t a common word according to our list of unigrams, and secondly that the word “ipod” followed by a space and “n” is most likely meant to be “ipod nano” – and we can then recommend that as an alternative search.

    Obviously this is very high level, and there are many, many other factors and variables to take into account (which is what makes it interesting!) – but it gives you a little background to the original point of this blog post; sharing my bigram function. 🙂

    The guts

    It’s relatively self explanatory, and would likely be rewritten as a CLR function before going into production, but having a TSQL function allows me to test my theory myself before asking a developer to spend any time on it.

    -- =============================================
    -- Author:		David Curlewis
    -- Create date: 2011-09-08
    -- Description:	Splits a string and returns word-level bigrams (i.e. all pairs of words making up the string)
    -- =============================================
    CREATE FUNCTION dbo.fn_word_bigrams (
        @string             NVARCHAR(1000),
        @include_delimiters BIT = 1 -- whether to include rows for prefix and suffix delimiters
    )
    RETURNS @bigrams TABLE (
        row_id  INT, 
        word1   NVARCHAR(100),
        word2   NVARCHAR(100) 
    )
    AS
    BEGIN
        DECLARE @rowcount   INT,
                @i          INT;
    
        DECLARE @tt_words   TABLE (row_id INT, word NVARCHAR(100))
        DECLARE @tt_bigrams TABLE (word1 NVARCHAR(100), word2 NVARCHAR(100))
    
        SET @i = 1;
        SET @string = ltrim(rtrim(@string)) 
            + CASE @include_delimiters WHEN 1 THEN ' ß' ELSE '' END;
    
        WITH split_words AS
        (
            SELECT  ROW_NUMBER() OVER(ORDER BY @@SPID) AS row_id,
                    item AS word
            FROM    dbo.SplitString_Multi(@string, ' ')
        )
        , bigrams AS 
        (
            SELECT  row_id,
                    CAST('ß' AS NVARCHAR(4000)) AS word1, -- ß is our delimiter
                    CAST(word AS NVARCHAR(4000)) AS word2
            FROM    split_words
            WHERE   row_id = 1
            UNION ALL
            SELECT  b.row_id + 1 AS row_id,
                    CAST(b.word2 AS NVARCHAR(4000)) AS word1, 
                    CAST(w.word AS NVARCHAR(4000)) AS word2
            FROM    split_words AS w
            JOIN    bigrams AS b ON w.row_id = b.row_id + 1
        )
        INSERT  @bigrams
        SELECT  row_id,
                word1,
                word2
        FROM    bigrams
        WHERE   row_id &amp;gt; CASE @include_delimiters WHEN 1 THEN 0 ELSE 1 END;
        
        RETURN; 
    END

    Shout if you have any questions or comments.

    Cheers,

    DB Dave

  • Full-Text Search Operators – Part 4: Wrapping it up

    "iPod nano 4Gb wrapping" by Darkmere
    That’s a wrap!

    Part 1: Cleaning up parentheses in a string
    Part 2: Replacing alternate operators
    Part 3: Splitting the search string

    Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two.  I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!

    Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.

    NEAR enough…

    “NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.

    In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them.  Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
    So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.

    -------------------------------------------------------------------------------------------------
    -- "NEAR" OPERATOR HANDLER
    -------------------------------------------------------------------------------------------------
    IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near')
    BEGIN
    -- reinitialise
    SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL;
    -- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...)
    DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR
    SELECT rowid
    FROM @tt_searchwords
    WHERE REPLACE(operator, '`', '') = 'op_near';
    
    OPEN near_string;
    FETCH NEXT FROM near_string INTO @i;
    WHILE @@FETCH_STATUS = 0
    BEGIN
    -- get the rowid of the value before and after the NEAR operator row
    SELECT @o = MAX(rowid)
    FROM @tt_searchwords
    WHERE rowid &lt; @i
    AND operator NOT LIKE 'paren%';
    
    SELECT @oo = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i
    AND operator NOT LIKE 'paren%';
    
    -- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option
    IF (@o IS NULL) OR (@oo IS NULL)
    DELETE @tt_searchwords
    WHERE rowid = @i;
    ELSE BEGIN
    -- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator
    SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END
    FROM @tt_searchwords
    WHERE rowid = @o;
    
    SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')'
    FROM @tt_searchwords
    WHERE rowid = @oo;
    
    -- update the NEAR operator record with the concatenated search term
    UPDATE @tt_searchwords
    SET word = @current_word,
    operator = 'searchnear'
    WHERE rowid = @i;
    
    -- delete the records before and after the NEAR operator
    UPDATE @tt_searchwords
    SET operator = 'searchword_near'
    WHERE rowid IN (@o, @oo);
    END
    
    FETCH NEXT FROM near_string INTO @i;
    END
    
    CLOSE near_string;
    DEALLOCATE near_string;
    
    -- delete search words that are now included IN "NEAR phrases"
    DELETE @tt_searchwords
    WHERE operator = 'searchword_near';
    END

    It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!

    Builderer of awesome

    This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together.  It’s not pretty, but it works.

    -------------------------------------------------------------------------------------------------
    -- Final command string builderer of awesome
    -------------------------------------------------------------------------------------------------
    -- Need to quote any search terms that contain certain non-alphanumeric characters
    UPDATE @tt_searchwords
    SET word = CASE
    WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"'
    WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"'
    ELSE word END
    WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted
    -- reinitialise
    SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = '';
    
    -- get the range of row IDs that represent valid search terms and operators
    SELECT @i = MIN(rowid),
    @ii = MAX(rowid)
    FROM @tt_searchwords
    WHERE operator LIKE 'search%'
    OR operator LIKE 'paren%';
    
    WHILE @i &lt;= @ii
    BEGIN
    SELECT @final_search_temp =
    CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN ' AND NOT '
    WHEN 'op_or' THEN ' OR '
    WHEN 'paren_o' THEN ' AND ('
    WHEN 'paren_x' THEN ')'
    ELSE ' AND '
    END
    + CASE REPLACE(operator, '`', '')
    WHEN 'op_not' THEN '' -- NOT
    WHEN 'op_or' THEN '' -- OR
    WHEN 'paren_o' THEN '' -- (
    WHEN 'paren_x' THEN '' -- )
    WHEN 'searchnear' THEN word -- NEAR
    WHEN 'searchphrase' THEN word -- "search phrase"
    WHEN 'searchexplicit' THEN word -- explicit search
    WHEN 'searchprefix' THEN word -- wildcard
    ELSE REPLACE(@word_syntax, '&lt;PLACEHOLDER/>', word)-- AND (default)
    END
    FROM @tt_searchwords
    WHERE rowid = @i;
    
    -- CONTAINSTABLE only accepts up to 4000 characters
    IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000
    BREAK
    ELSE
    SET @final_search = @final_search + @final_search_temp;
    
    IF @i = @ii BREAK;
    
    SELECT @i = MIN(rowid)
    FROM @tt_searchwords
    WHERE rowid > @i;
    END
    
    -- sort out issues of operator double-ups, etc
    -- clunky but it works
    SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT ');
    SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR ');
    SET @final_search = '(' + @final_search + ')';
    SET @final_search = REPLACE(@final_search, '( AND NOT ', '(');
    SET @final_search = REPLACE(@final_search, ' AND NOT )', ')');
    SET @final_search = REPLACE(@final_search, '( AND ', '(');
    SET @final_search = REPLACE(@final_search, ' AND )', ')');
    SET @final_search = REPLACE(@final_search, '( OR ', '(');
    SET @final_search = REPLACE(@final_search, ' OR )', ')');
    SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);

    That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.

    Cheers
    DB Dave