Tag: script

What’s in your toolbox?
I’ve had a few weeks off work recently, during which time I decided to tidy up (or attempt to anyway) my collection of “utility scripts”. This consists of about 95% regular T-SQL scripts, with the rest being a mixture of PowerShell scripts, batch files, etc.

This made me wonder how other people managed their collections, and whether their are better ways (or tools) to catalogue and easily retrieve useful scripts and snippets?

What I do

My solution is to use the built in SQL Server Management Studio Template Explorer. I found a few good blog posts explaining where the templates are stored, and some different ways of getting the most out of them (for example, I went down the route of creating symbolic links so I could have one central set of scripts stored on my Skydrive, which I have access to from any computer).

I was thinking about taking this one step further; I’m the database team lead at work, and it might be useful to collaboratively compile a list of scripts that each of us find useful/indispensable, which we could then add to our source control repository. We could then each have symbolic links to this repository on our own computers, meaning the whole team would automatically have access to the latest versions of our “über team toolbox” scripts.

What’s in my toolbox?

I’ve got a bunch of stuff… and I’ve still got a lot to categorise and move into my templates folder, but it’s a mix of some pretty well known utility scripts from some equally well known “SQL personalities”, as well as a lot of stuff I’ve written over the years and tucked aside for a rainy day, or scripts that I regularly need for the applications that we support.

My “well known” utility scripts currently include:
- Aaron Bertrand’s more reliable sp_MSforeachdb
- Adam Machanic’s indispensable sp_whoisactive
- Brent Ozar’s comprehensive sp_Blitz and sp_BlitzIndex
- Ola Hallengren’s automated maintenance “solution”
I have others that fall into this category (e.g. some queries from Glen Berry) that I’m still tidying up and organising. The SQLServerPedia site also has a list of useful scripts which I’ve bookmarked for later follow-up.

What else is there?

Well, if you don’t think using templates is going to work for you, then there are a few other options (although not as many as I would’ve thought – hence the reason for this post)…

Snippets

I think of snippets as mini, or “in-line” templates. The reason being most snippets are fragments of a larger script – for example you can have a snippet to insert a common stored procedure header block that you always use, or the syntax for a cursor declaration. My templates tend to be more full-featured scripts, but I’m sure these could be stored as snippets too. Here’s an article on snippets, including how to create your own.

Talking about snippets – it’s easy to confuse the built-in SSMS snippets (which is what I was talking about above) with those included in Mladen Prajdić’s very cool SSMS Tools Pack add-in. In SSMS Tools Pack they’re more like fully customisable shortcuts to code snippets. This means you don’t need to have an extra window open to select the snippet, you can just (for example) type ssf and hit enter, and it expands out to SELECT * FROM automagically. 😉

SQL Scripts Manager

This is a free tool from Red Gate, which to be honest I haven’t played with very much yet (it’s still on my to-do list). It seems to offer some cool features (like tagging scripts, a list of authors, etc), and comes bundled with a range of very useful scripts to get you started. Here’s a write-up on it.

I’ll definitely take a closer look at it when I get a chance – I’m interested to see where it stores the scripts (i.e. whether I can plug it into a central source control repository), and also whether there is any SSMS integration (which would be way cooler than just an external app).

Good ol’ Explorer

I’m guessing this is still the most common method; just putting a bunch of snippets and scripts into a directory (or, if you’ve got some semblance of order, perhaps a categorised series of directories).

If this works for you (as it did me for many years), and you can still easily find what you need, then I guess there’s no real reason to change. 😉

What do you use?

So, what method do you use to keep your scripts library in order, and do you have any other cool scripts you’d recommend I add to my library? I’d love to hear from you via the comments section below.

Cheers,
DB Dave
2013-02-17
SQL script of HTML status codes

Bit of a random post; I couldn’t find anything similar online, so created this script and thought I’d share it in case anyone randomly needs a lookup/dimension table of HTML status codes. All pulled directly from Wikipedia. 🙂

2012-10-13
CURSORs and WHILE-Loops
Yes, I know this is a giant (some might say pointless, and I wouldn’t totally disagree) can of worms I’m opening here, but stay with me for a minute

Why?

We were discussing this at work a few months ago, and I was one of those people who followed “conventional wisdom” which says that SQL cursors are evil, and should be replaced with WHILE-loops if a set-based alternative isn’t possible (or practical).

So I thought I’d do some testing for myself. I know this has been done more than once before (I won’t mention any articles specifically, but they’re easy to find using your search engine of choice) – but I like to figure things out for myself rather than read a blog and take the author’s word as infallible. Some of the other tests I found online also only concentrated on one specific metric when measuring performance (e.g. physical I/O), or were too obviously biased for my taste.

Living in the Real World

There are also what I refer to as “non-real-world” arguments against one or the other, for example:
- Cursors are best because they’re easier/quicker to write (usually less variables and/or temp tables required).
- While-loops are best because they’re inherently faster, since they don’t have to make use any of the internal mechanisms that cursors use to allow backwards and forwards data-set traversal.
The reason I call these “non-real-world” problems, is that I don’t think any DBA or developer worth their salt is going to worry too much about how many more characters one takes to type than the other, and I’ve very seldom (if ever) needed to use anything but a simple forward-only cursor – so these problems aren’t representative of the majority of use-cases in my experience.

Obviously if you take something like this super-simplified while-loop below, it would be difficult to write a cursor that would perform faster – but why the hell would you?
```
DECLARE @i INT = 0;
WHILE @i <= 1000000 SET @i += 1; -- 405ms on my laptop
```
What?

Alright, down to business. Using the AdventureWorks database on my laptop (running 2008R2 build 10.50.2500.0), I wrote a simple test script which iterates through each SalesOrderHeader row (for online orders) and counts the number of SalesOrderDetail rows for each SalesOrderID. I know… for someone who was just talking about “non-real-world” examples, this is pretty unrealistic – but the point is that it generates a measureable amount of I/O and CPU load, and I can reproduce the logic easily using both a while-loop and a cursor.

Here’s the code for the while-loop test:
```
------------------------------------------------
-- WHILE LOOP (using counter)
------------------------------------------------
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
GO
SET NOCOUNT ON;
GO
IF OBJECT_ID('tempdb..#SOID') IS NOT NULL DROP TABLE #SOID;
CREATE TABLE #SOID (ID INT IDENTITY, SalesOrderID INT NOT NULL);
DECLARE @i INT = 1, @ii INT, @SalesOrderID INT, @LineCount INT;

INSERT #SOID (SalesOrderID)
SELECT SalesOrderID
FROM Sales.SalesOrderHeader
WHERE OnlineOrderFlag = 1;

SET @ii = @@ROWCOUNT;

WHILE @i &amp;lt;= @ii
BEGIN
SET @SalesOrderID = (SELECT SalesOrderID FROM #SOID WHERE ID = @i);

SET @LineCount = ISNULL(@LineCount, 0) + ISNULL((
SELECT COUNT(*)
FROM Sales.SalesOrderDetail
WHERE SalesOrderID = @SalesOrderID
),0);

SET @i += 1;
END
IF @LineCount &amp;lt;&amp;gt; 60398
BEGIN
RAISERROR('Bad Total',16,1);
PRINT @LineCount;
END
GO
```
Even for a simple while-loop there are multiple ways to write the logic – I’ve chosen a simple counter here, but didn’t see massive variances in the performance of the 2 or 3 methods I tested.

Here’s the code I used to test cursors:
```
------------------------------------------------
-- CURSOR
------------------------------------------------
--DECLARE cursor_name CURSOR [ LOCAL | GLOBAL ]
-- [ FORWARD_ONLY | SCROLL ]
-- [ STATIC | KEYSET | DYNAMIC | FAST_FORWARD ]
-- [ READ_ONLY | SCROLL_LOCKS | OPTIMISTIC ]
-- [ TYPE_WARNING ]
CHECKPOINT;
DBCC DROPCLEANBUFFERS;
GO
SET NOCOUNT ON;
GO
DECLARE @SalesOrderID INT, @LineCount INT;
DECLARE SOID CURSOR TYPE_WARNING
FOR
SELECT SalesOrderID
FROM Sales.SalesOrderHeader
WHERE OnlineOrderFlag = 1;

OPEN SOID;
FETCH NEXT FROM SOID INTO @SalesOrderID;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @LineCount = ISNULL(@LineCount,0) + ISNULL((
SELECT COUNT(*)
FROM Sales.SalesOrderDetail
WHERE SalesOrderID = @SalesOrderID
), 0);

FETCH NEXT FROM SOID INTO @SalesOrderID;
END
CLOSE SOID;
DEALLOCATE SOID;
IF @LineCount &lt;&gt; 60398
BEGIN
RAISERROR('Bad Total',16,1);
PRINT @LineCount;
END
GO
```
This represents the simplest test (i.e. just a plain cursor declaration with no options other than TYPE_WARNING). I used this as a baseline, to which I added various combinations of cursor options, such as STATIC vs DYNAMIC, LOCAL vs GLOBAL, etc.

To measure performance I profiled the execution of the main batch (i.e. not including the checkpoint, clearing the caches, etc) and recorded CPU, Duration, Reads, & Writes over 5 executions per test run (I actually ran each more than just 5 times, but results were so consistent that I only recorded the last 5 executions of each test).

Show me the money

The surprising thing for me (in this test case) was how poorly the while-loop performed. The graph to the right shows the various cursor-based test runs, all of which completed in less than 1 second (click the image to viewer a larger copy).

The while-loop consistently took 30 seconds to run.

By the way, please feel free to point out any mistakes I’ve made.

The reads show a similar story regarding the stark difference between the while-loop and any of the cursor tests. The cursor read I/O figures are in the graph to the right, and the while-loop averaged well over 1.8 million reads (it dwarfed the others on the graph so I removed it).

I guess what I find interesting is that many people have run tests, and look at logical or physical I/O, but don’t take into account CPU or overall duration. If my data is sitting in memory anyway, I don’t really care about those logical reads – but the difference between a query taking 100 vs 1000 ms of my CPU’s time is very important to me. Am I neglecting to take anything else (important) into account?

Conclusion

I don’t have one…

These tests are too fickle to make any concrete observations – you could change just a few lines of code and get completely different results to what I’m seeing.

Personally, I will continue to use both where appropriate. Before anyone has a heart attack let me clarify that I would always first use a set-based solution before resorting to either.

But… my rule of thumb is that I use CURSORS if I’m performing data manipulation, and a WHILE loop if I’m performing an action until some condition is met. For cursors, I tend to use the “LOCAL STATIC FORWARD_ONLY READ_ONLY” options, just because they roll off the tongue so nicely. If you’re on the SQL Server team at Microsoft and you’re reading this – in the next version of SQL could you please consolidate these options down to a single “GO_BABY_GO” option instead?

As a final note, highlighting why a set-based alternative is always better; running this query “properly” (i.e. set-based, not RBR) gave me just over 900 reads, and completed in 29 milliseconds. No contest really.

Cheers,

Dave
2012-03-17

Preventing identity overflow

Here’s a stored procedure I put together a week or so ago that I thought others might find useful.

We had an issue which should really never be allowed to happen (but as they say, “shit happens”); a rather important table ran out of integer values for the identity column. Luckily the actual value of the ID column on this table isn’t important because the table only stores the most recent month’s data, and everything before that is archived – so we could simply reseed the identity value back to 0 (which had long since been purged from this table).

To prevent this in future I put together a pretty simple query to check all identity values, and check them against the maximum value for their given data types. This eventually became a procedure which is now just part of my “utility” database which I deploy to every production server.

IF OBJECT_ID('dbo.usp_check_max_identity_value') IS NULL
EXEC('CREATE PROC dbo.usp_check_max_identity_value AS ');
GO
ALTER PROC dbo.usp_check_max_identity_value (
@alert_threshold TINYINT = 60, -- The percentage of the max value over which we will alert.
@send_alert_as_email BIT = 0, -- Do you want to send any alerts as an email? (Well... do ya, punk?)
@alert_email_recipients NVARCHAR(MAX) = NULL -- Semicolon-delimited list of email recipients.
) AS
BEGIN
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

DECLARE @srv_name VARCHAR(200),
@db_name VARCHAR(200),
@table_name VARCHAR(200),
@column_name VARCHAR(200),
@data_type VARCHAR(200),
@max_value BIGINT,
@max_tinyint TINYINT,
@max_smallint SMALLINT,
@max_int INT,
@max_bigint BIGINT,
@cmd NVARCHAR(4000),
@percent_of_max INT,
@alert_subject NVARCHAR(255),
@alert_body NVARCHAR(MAX),
@alert_counter INT;

DECLARE @tt_alert_data TABLE (
row_id INT IDENTITY,
[db_name] VARCHAR(200),
table_name VARCHAR(200),
column_name VARCHAR(200),
data_type VARCHAR(200),
percent_of_max INT
);

-- These are the max values of each integer data type
SELECT @max_tinyint = 255,
@max_smallint = 32767,
@max_int = 2147483647,
@max_bigint = 9223372036854775807;

-- Get a list of all available user databases to loop through
DECLARE db_cursor CURSOR LOCAL STATIC FORWARD_ONLY READ_ONLY FOR
SELECT DISTINCT
QUOTENAME(name)
FROM sys.databases
WHERE database_id &gt; 4
AND is_read_only = 0
AND state_desc = 'ONLINE'

OPEN db_cursor;
FETCH NEXT FROM db_cursor INTO @db_name;
WHILE @@FETCH_STATUS = 0
BEGIN

-- Get a list of all user tables in the current database
SET @cmd = N'
DECLARE tables_cursor CURSOR GLOBAL STATIC FORWARD_ONLY READ_ONLY FOR
SELECT DISTINCT
QUOTENAME(SCHEMA_NAME(t.[schema_id])) + ''.'' + QUOTENAME(t.name) AS table_name,
QUOTENAME(c.name) AS column_name,
typ.name AS data_type
FROM ' + @db_name + '.sys.tables t
JOIN ' + @db_name + '.sys.columns c ON t.[object_id] = c.[object_id]
JOIN ' + @db_name + '.sys.types typ ON c.system_type_id = typ.system_type_id
WHERE t.[type] = ''U''
AND t.is_ms_shipped = 0
AND c.is_identity = 1
AND typ.name IN (''tinyint'',''smallint'',''int'',''bigint'')
ORDER BY QUOTENAME(SCHEMA_NAME(t.[schema_id])) + ''.'' + QUOTENAME(t.name);
';

EXEC sp_executesql @cmd, N'';

-- Loop through the list of tables
OPEN tables_cursor;
FETCH NEXT FROM tables_cursor INTO @table_name, @column_name, @data_type;
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @max_value = NULL, @percent_of_max = NULL;
SET @cmd = N'SELECT @max = MAX(' + @column_name + ') FROM ' + @db_name + '.' + @table_name + ' WITH (NOLOCK);';

-- Run the SELECT to fetch the maximum value from the IDENTITY column
EXEC sp_executesql @cmd,
N'@max BIGINT OUTPUT',
@max = @max_value OUTPUT;

IF @max_value IS NULL GOTO FetchNext;

-- Work out the percentage this represents of the maximum posible value
SELECT @percent_of_max = CAST(CASE
WHEN @data_type = 'tinyint' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_tinyint AS FLOAT)) * 100.0)
WHEN @data_type = 'smallint' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_smallint AS FLOAT)) * 100.0)
WHEN @data_type = 'int' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_int AS FLOAT)) * 100.0)
WHEN @data_type = 'bigint' THEN ((CAST(@max_value AS FLOAT) / CAST(@max_bigint AS FLOAT)) * 100.0)
END AS INT)

-- If the percentage is greater than our threshold, raise an alert
IF (@percent_of_max &gt; @alert_threshold)
BEGIN
-- Insert the values for use later (if we're going to send an email)
IF (@send_alert_as_email = 1)
BEGIN
INSERT @tt_alert_data ([db_name], table_name, column_name, data_type, percent_of_max )
VALUES (@db_name, @table_name, @column_name, @data_type, @percent_of_max);
END

RAISERROR(
'%s - Identity column %s on table %s is currently at %i%% of an %s''s maximum value.',
10,
1,
@db_name,
@column_name,
@table_name,
@percent_of_max,
@data_type
) WITH LOG;

SET @alert_counter = ISNULL(@alert_counter,0) + 1;
END

FetchNext:
FETCH NEXT FROM tables_cursor INTO @table_name, @column_name, @data_type;
END

CLOSE tables_cursor;
DEALLOCATE tables_cursor;

FETCH NEXT FROM db_cursor INTO @db_name;
END

CLOSE db_cursor;
DEALLOCATE db_cursor;

-- If we want to send an alert email (and there is one to send), then let's do that...
IF (@send_alert_as_email = 1) AND (ISNULL(@alert_counter, 0) &gt; 0)
BEGIN
-- If we haven't provided an email recipient, just carry on regardless
IF @alert_email_recipients IS NULL RETURN;

SET @srv_name = QUOTENAME(CAST(@@SERVERNAME AS VARCHAR(200)));
SET @alert_subject = @srv_name + ' - Identity value overflow warning - ' + CAST(@alert_counter AS VARCHAR) + ' columns found.';

-- Build the email body (HTML tags for the benefit of making the email pretty, but not necessary)
set @alert_body = '&lt;html&gt;&lt;body&gt;&lt;style type="text/css"&gt; h3, p, table {font-family:verdana;}&lt;/style&gt;';
set @alert_body = @alert_body +
N'&lt;H3&gt;Identity value overflow warning&lt;/H3&gt;' +
N'&lt;p&gt;This alert means that the following tables on the server ' + @srv_name
+ ' have integer-based IDENTITY column values which are currently greater than &lt;strong&gt;'
+ CAST(@alert_threshold AS VARCHAR) + '%&lt;/strong&gt; of their respective data-types'' maximum possible values.&lt;/p&gt;' +
N'&lt;p&gt;This is just a warning, but you may want to consider increasing the size of the data-type you use (if possible) '
+ 'or reseeding the table (if the IDENTITY value itself is not important, other than it being unique).&lt;/p&gt;' +
N'&lt;table border="1" cellspacing="0" cellpadding="3"&gt;' +
N'&lt;tr style="background-color:#c9cfff;"&gt;&lt;th&gt;Database&lt;/th&gt;&lt;th&gt;Table&lt;/th&gt;&lt;th&gt;Column&lt;/th&gt;&lt;th&gt;Type&lt;/th&gt;&lt;th&gt;%&lt;/th&gt;&lt;/tr&gt;' +
CAST ( ( SELECT td = [db_name], '',
td = [table_name], '',
td = [column_name], '',
td = [data_type], '',
td = [percent_of_max]
FROM @tt_alert_data
ORDER BY row_id
FOR XML PATH('tr'), TYPE
) AS NVARCHAR(MAX)) +
N'&lt;/table&gt;&lt;/body&gt;&lt;/html&gt;';

-- Send the email (NB: This assumes that you have a default Database Mail profile configured)
EXEC msdb.dbo.sp_send_dbmail
@recipients = @alert_email_recipients,
@subject = @alert_subject,
@body = @alert_body,
@body_format = 'HTML',
@importance = 'High';
END
END
GO

It’s not very complex – it iterates through all user databases on the server, and then over all user tables that have an identity column, and simply selects the maximum values from each of those identity columns. This max value is then compared to the max value for the respective data type of the column (i.e. tinyint, smallint, int, or bigint).

What it does after that is up to you. In this case I send an email with a nicely formatted HTML email displaying the offending columns. Obviously you can customise what action to take, and what your threshold is. And yes, the title of this post is a little misleading since this doesn’t actually prevent anything, but rather alerts you to impending doom.

Warning: Before running this on a production system, make sure that you’re happy for it to be selecting from all of your (potentially large and/or busy) tables that have identity columns. If you have parallel test/dev systems, then run them there instead to prevent any negative performance impact in production. I also take no responsibility for bugs or unexpected behaviour. Consult your GP before taking.

Cheers

DB Dave

2012-01-12

SQL Server word-level bigram function
I work for an online auction website, Trade Me, and as such collect quite a large quantity of search data (i.e. full text searches that people perform). Over the years I’ve been involved in various aspects of our search; from tweaking the nuts and bolts of the full-text search itself (see my blog posts about boolean search operators for example) to working on our recommendation algorithm, and implementing search auto-suggest.

Something I’m looking into at the moment is changing the way we recommend alternative search terms when we think you’ve misspelt something – i.e. our “did-you-mean” system.

Did you mean…?

The way we currently do it is based on taking the individual words entered, and generating a large number of alternative spellings for each of them, and then comparing those lists of words against a “known-good” list of search terms and returning the most likely one for each word entered (according to a magic score based on frequency, average results per search, etc). This is done on the fly and is surprisingly quick (it’s all done within TSQL functions and procedures), but I want to experiment with word level n-grams (unigrams, bigrams, and trigrams specifically).

Basically, this involves taking each search string and splitting it into individual words to give you your unigrams, or then pairing them up (in the case of bigrams).

For example, the string “pink ipod nano” would be returned as:

Word 1 Word 2
# pink
pink ipod
ipod nano
nano #

The hash symbols are delimiters and are sometimes not used, in which case the 1st and last rows of this table wouldn’t be returned. In my function below, I use a less common symbol (the “Eszett”) as my delimiter, and have a parameter to allow it to be omitted if necessary.

What this allows you to do is relatively easily store a large set of known n-grams and their occurrence frequencies (i.e. how common they are), and then predict either what word the user is likely to type next (in the case of an auto-suggestion system), or what the user actually meant to type in the case of a misspelling.

For example, if a user types “ipod” followed by a space into the search box, then given a large enough working dataset you should be able to predict what the next word is likely to be (even more so as they start typing the first few letters). Or, if the same user types in “ipod nqno“, then we should be able to work out that firstly “nqno” isn’t a common word according to our list of unigrams, and secondly that the word “ipod” followed by a space and “n” is most likely meant to be “ipod nano” – and we can then recommend that as an alternative search.

Obviously this is very high level, and there are many, many other factors and variables to take into account (which is what makes it interesting!) – but it gives you a little background to the original point of this blog post; sharing my bigram function. 🙂

The guts

It’s relatively self explanatory, and would likely be rewritten as a CLR function before going into production, but having a TSQL function allows me to test my theory myself before asking a developer to spend any time on it.
```
-- =============================================
-- Author:		David Curlewis
-- Create date: 2011-09-08
-- Description:	Splits a string and returns word-level bigrams (i.e. all pairs of words making up the string)
-- =============================================
CREATE FUNCTION dbo.fn_word_bigrams (
    @string             NVARCHAR(1000),
    @include_delimiters BIT = 1 -- whether to include rows for prefix and suffix delimiters
)
RETURNS @bigrams TABLE (
    row_id  INT, 
    word1   NVARCHAR(100),
    word2   NVARCHAR(100) 
)
AS
BEGIN
    DECLARE @rowcount   INT,
            @i          INT;

    DECLARE @tt_words   TABLE (row_id INT, word NVARCHAR(100))
    DECLARE @tt_bigrams TABLE (word1 NVARCHAR(100), word2 NVARCHAR(100))

    SET @i = 1;
    SET @string = ltrim(rtrim(@string)) 
        + CASE @include_delimiters WHEN 1 THEN ' ß' ELSE '' END;

    WITH split_words AS
    (
        SELECT  ROW_NUMBER() OVER(ORDER BY @@SPID) AS row_id,
                item AS word
        FROM    dbo.SplitString_Multi(@string, ' ')
    )
    , bigrams AS 
    (
        SELECT  row_id,
                CAST('ß' AS NVARCHAR(4000)) AS word1, -- ß is our delimiter
                CAST(word AS NVARCHAR(4000)) AS word2
        FROM    split_words
        WHERE   row_id = 1
        UNION ALL
        SELECT  b.row_id + 1 AS row_id,
                CAST(b.word2 AS NVARCHAR(4000)) AS word1, 
                CAST(w.word AS NVARCHAR(4000)) AS word2
        FROM    split_words AS w
        JOIN    bigrams AS b ON w.row_id = b.row_id + 1
    )
    INSERT  @bigrams
    SELECT  row_id,
            word1,
            word2
    FROM    bigrams
    WHERE   row_id &amp;gt; CASE @include_delimiters WHEN 1 THEN 0 ELSE 1 END;
    
    RETURN; 
END
```
Shout if you have any questions or comments.

Cheers,

DB Dave
2011-09-08

Full-Text Search Operators – Part 4: Wrapping it up

"iPod nano 4Gb wrapping" by Darkmere — That’s a wrap!

Part 1: Cleaning up parentheses in a string
Part 2: Replacing alternate operators
Part 3: Splitting the search string

Gee, it’s been over a month since my last post already! Since then we’ve been crazy-busy at work, but the good news is that we have a few promising DBA candidates to interview over the next week or two. I’ve also spoken at the Christchurch and Auckland SQL Server user groups, which were both awesome!

Anyway, to the point of this post. This will wrap up the series on my solution to the challenge of creating a T/SQL-based Google-like search term handler for SQL full-text. In this exciting episode I’ll run through how I handle quotes (single & double quote pairs), the “near” operator, and then the building of the final full-text command.

NEAR enough…

“NEAR” is a full-text proximity operator. This sounds really straightforward, but the catch here is that it actually behaves exactly like the default “AND” Boolean operator unless you’re filtering based on the full-text rank value returned by CONTAINS or CONTAINSTABLE.

In other words, if you search for “dog NEAR cat”, then all documents returned must have both “dog” and “cat” in them. Where NEAR is different to AND, is that it awards a higher score to documents where “dog” and “cat” are 50 characters or less apart (with a higher score the closer the terms are to each other, or 0 if they are more than 50 characters apart).
So in order for NEAR to actually limit your search results, you’ll need to add a WHERE clause to your query, limiting results to those with ranks greater than 0. Read more here.

-------------------------------------------------------------------------------------------------
-- "NEAR" OPERATOR HANDLER
-------------------------------------------------------------------------------------------------
IF EXISTS(SELECT * FROM @tt_searchwords WHERE REPLACE(operator, '`', '') = 'op_near')
BEGIN
-- reinitialise
SELECT @i = NULL, @current_word = '', @o = NULL, @oo = NULL;
-- a cursor to loop through each NEAR operator row (yes, I'm getting lazy here...)
DECLARE near_string CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY FOR
SELECT rowid
FROM @tt_searchwords
WHERE REPLACE(operator, '`', '') = 'op_near';

OPEN near_string;
FETCH NEXT FROM near_string INTO @i;
WHILE @@FETCH_STATUS = 0
BEGIN
-- get the rowid of the value before and after the NEAR operator row
SELECT @o = MAX(rowid)
FROM @tt_searchwords
WHERE rowid &lt; @i
AND operator NOT LIKE 'paren%';

SELECT @oo = MIN(rowid)
FROM @tt_searchwords
WHERE rowid > @i
AND operator NOT LIKE 'paren%';

-- if the NEAR-operator (~) is the first or last word in the search string, its not a valid option
IF (@o IS NULL) OR (@oo IS NULL)
DELETE @tt_searchwords
WHERE rowid = @i;
ELSE BEGIN
-- since NEAR requires 2 words (one on each side), we get the word before and after the ~ operator
SELECT @current_word = '(' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END
FROM @tt_searchwords
WHERE rowid = @o;

SELECT @current_word = @current_word + ' NEAR ' + CASE WHEN PATINDEX('%[()!,]%', word) > 0 THEN '"' + word + '"' ELSE word END + ')'
FROM @tt_searchwords
WHERE rowid = @oo;

-- update the NEAR operator record with the concatenated search term
UPDATE @tt_searchwords
SET word = @current_word,
operator = 'searchnear'
WHERE rowid = @i;

-- delete the records before and after the NEAR operator
UPDATE @tt_searchwords
SET operator = 'searchword_near'
WHERE rowid IN (@o, @oo);
END

FETCH NEXT FROM near_string INTO @i;
END

CLOSE near_string;
DEALLOCATE near_string;

-- delete search words that are now included IN "NEAR phrases"
DELETE @tt_searchwords
WHERE operator = 'searchword_near';
END

It’s pretty well commented, and isn’t very complicated so I won’t waste any more space here explaining it. If you do have any questions please feel free to leave me a comment though!

Builderer of awesome

This next bit basically takes the content of the table variable (which now contains our correctly formatted and tidied up command fragments), and smushes them together. It’s not pretty, but it works.

-------------------------------------------------------------------------------------------------
-- Final command string builderer of awesome
-------------------------------------------------------------------------------------------------
-- Need to quote any search terms that contain certain non-alphanumeric characters
UPDATE @tt_searchwords
SET word = CASE
WHEN PATINDEX('%[[[][]]()!,]%', word) > 0 THEN '"' + word + '"'
WHEN CHARINDEX('[', word) > 0 THEN '"' + word + '"'
WHEN CHARINDEX(']', word) > 0 THEN '"' + word + '"'
ELSE word END
WHERE operator NOT IN ('searchphrase', 'searchnear', 'searchprefix'); -- these operator types are already quoted
-- reinitialise
SELECT @i = NULL, @ii = NULL, @final_search = '', @final_search_temp = '', @current_word = '';

-- get the range of row IDs that represent valid search terms and operators
SELECT @i = MIN(rowid),
@ii = MAX(rowid)
FROM @tt_searchwords
WHERE operator LIKE 'search%'
OR operator LIKE 'paren%';

WHILE @i &lt;= @ii
BEGIN
SELECT @final_search_temp =
CASE REPLACE(operator, '`', '')
WHEN 'op_not' THEN ' AND NOT '
WHEN 'op_or' THEN ' OR '
WHEN 'paren_o' THEN ' AND ('
WHEN 'paren_x' THEN ')'
ELSE ' AND '
END
+ CASE REPLACE(operator, '`', '')
WHEN 'op_not' THEN '' -- NOT
WHEN 'op_or' THEN '' -- OR
WHEN 'paren_o' THEN '' -- (
WHEN 'paren_x' THEN '' -- )
WHEN 'searchnear' THEN word -- NEAR
WHEN 'searchphrase' THEN word -- "search phrase"
WHEN 'searchexplicit' THEN word -- explicit search
WHEN 'searchprefix' THEN word -- wildcard
ELSE REPLACE(@word_syntax, '&lt;PLACEHOLDER/>', word)-- AND (default)
END
FROM @tt_searchwords
WHERE rowid = @i;

-- CONTAINSTABLE only accepts up to 4000 characters
IF LEN(CAST(@final_search + @final_search_temp AS VARCHAR(MAX))) > 4000
BREAK
ELSE
SET @final_search = @final_search + @final_search_temp;

IF @i = @ii BREAK;

SELECT @i = MIN(rowid)
FROM @tt_searchwords
WHERE rowid > @i;
END

-- sort out issues of operator double-ups, etc
-- clunky but it works
SET @final_search = REPLACE(@final_search, ' AND NOT AND ', ' AND NOT ');
SET @final_search = REPLACE(@final_search, ' OR AND ', ' OR ');
SET @final_search = '(' + @final_search + ')';
SET @final_search = REPLACE(@final_search, '( AND NOT ', '(');
SET @final_search = REPLACE(@final_search, ' AND NOT )', ')');
SET @final_search = REPLACE(@final_search, '( AND ', '(');
SET @final_search = REPLACE(@final_search, ' AND )', ')');
SET @final_search = REPLACE(@final_search, '( OR ', '(');
SET @final_search = REPLACE(@final_search, ' OR )', ')');
SET @final_search = SUBSTRING(@final_search, 2, LEN(@final_search)-2);

That’s it – I’ve left out some of the fluff in-between the various blocks of code, but have attached the full stored proc here for you to download and tinker with. All I ask is that if you make it better, please let me know so I can improve on this. It could definitely use some love to make it prettier and more efficient, although if pure performance is your goal you should probably investigate a CLR solution instead.

Cheers
DB Dave

2011-07-26

Word 1	Word 2
#	pink
pink	ipod
ipod	nano
nano	#

Tag: script

What’s in your toolbox?

What I do

What’s in my toolbox?

What else is there?

Snippets

SQL Scripts Manager

Good ol’ Explorer

What do you use?

SQL script of HTML status codes

CURSORs and WHILE-Loops

Why?

Living in the Real World

What?

Show me the money

Conclusion

Preventing identity overflow

SQL Server word-level bigram function

Did you mean…?

The guts

Full-Text Search Operators – Part 4: Wrapping it up

NEAR enough…

Builderer of awesome