This post isn’t related to the series I’m currently writing on Full-Text search operators, but it is in the same ball-park so I thought I’d throw it up here.
Today I was working on our search term auto-suggest system (i.e. the system that shows a list of suggestions in a drop-down list below the search box as you type your search term in), and needed to improve the way we strip out “illegal” characters from our dictionary terms.
Illegal characters in our case refers to pretty much anything that isn’t alphanumeric, or a space, dollar sign, ampersand, hyphen, or period. We don’t even want any accented characters (i.e. letters with diacritics) – although in this case we want to replace accented characters with their un-accented counterparts (e.g. Å should be replaced by A).
The way I did this (and I’m sure there are many, many better ways) is to insert illegal characters into a table variable, along with a replacement character, and then update the input variable for each row in this table variable – replacing any matching illegal characters each time with it’s replacement value.
IF OBJECT_ID('[dbo].[fn_clean_search_term]') IS NULL EXEC('CREATE FUNCTION [dbo].[fn_clean_search_term] () RETURNS INT AS BEGIN RETURN(0); END '); GO /*************************************************************************************************************************** This accepts a string input, and outputs the same string, but with any invalid characters removed, or replaced if they are accented characters. Note: not all accented characters are accounted for, only the most common ones in our data. Also, this does not handle Unicode characters. Author : David Curlewis Date : 07/2009 This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/ ****************************************************************************************************************************/ ALTER FUNCTION [dbo].[fn_clean_search_term] ( @keyword VARCHAR(2048) ) RETURNS VARCHAR(2048) WITH SCHEMABINDING AS BEGIN -- Only do the heavy lifting if there are non-alphanumeric characters in the string IF PATINDEX('%[^A-Za-z0-9 ]%', @keyword) > 0 BEGIN -- if the string contains anything other than "valid" characters, strip them out DECLARE @chars TABLE ([ascii] TINYINT NOT NULL, [new_char] VARCHAR(1)); ;WITH illegal_chars AS ( SELECT 0 AS [ascii] UNION ALL SELECT [ascii] + 1 FROM illegal_chars WHERE [ascii] < 255 ) INSERT @chars ([ascii], [new_char]) SELECT [ascii], CASE -- Replace accented letters with non-accented letters to normalise the search terms -- Unless your collation is case sensitive, the lower & upper case variants are equivalent WHEN CHAR([ascii]) IN ('á','à','â','ä','å') THEN 'a' WHEN CHAR([ascii]) IN ('Á','Ã','Ä','Å','À','Â') THEN 'A' WHEN CHAR([ascii]) IN ('ç') THEN 'c' WHEN CHAR([ascii]) IN ('Ç') THEN 'C' WHEN CHAR([ascii]) IN ('ê','é','ë','è') THEN 'e' WHEN CHAR([ascii]) IN ('Ê','Ë','É','È') THEN 'E' WHEN CHAR([ascii]) IN ('ï','í','î','ì') THEN 'i' WHEN CHAR([ascii]) IN ('Í','Ì','Î','Ï') THEN 'I' WHEN CHAR([ascii]) IN ('ñ') THEN 'n' WHEN CHAR([ascii]) IN ('Ñ') THEN 'N' WHEN CHAR([ascii]) IN ('ô','ö','ò','õ','ó','ø') THEN 'o' WHEN CHAR([ascii]) IN ('Ó','Ô','Õ','Ø','Ö','Ò') THEN 'O' WHEN CHAR([ascii]) IN ('š') THEN 's' WHEN CHAR([ascii]) IN ('Š') THEN 'S' WHEN CHAR([ascii]) IN ('ú','ü','û','ù') THEN 'u' WHEN CHAR([ascii]) IN ('Ù','Ú','Ü','Û') THEN 'U' WHEN CHAR([ascii]) IN ('ÿ','ý') THEN 'y' WHEN CHAR([ascii]) IN ('Ÿ','Ý') THEN 'Y' WHEN CHAR([ascii]) IN ('ž') THEN 'z' WHEN CHAR([ascii]) IN ('Ž') THEN 'Z' ELSE '' -- remove the character END AS [new_char] FROM illegal_chars WHERE [ascii] NOT BETWEEN 48 AND 57 -- 0 -> 9 AND [ascii] NOT BETWEEN 65 AND 90 -- A -> Z AND [ascii] NOT BETWEEN 97 AND 122 -- a -> z AND [ascii] NOT IN ( -- explicitly allow the following characters 32, -- <space> 36, -- $ 38, -- & 45, -- - 46 -- . ) OPTION (MAXRECURSION 255); UPDATE @chars SET @keyword = REPLACE(@keyword, CHAR([ascii]), [new_char]) END -- remove multiple spaces WHILE CHARINDEX(' ', @keyword) > 0 SELECT @keyword = REPLACE(@keyword, ' ', ' '); SELECT @keyword = LTRIM(RTRIM(@keyword)); -- trim leading & trailing whitespace SELECT @keyword = LOWER(@keyword); -- return lower-case only RETURN @keyword; END GO
Let me know if you have any questions, or different/better ways of doing this (yes, I know I could do it using a couple of lines of .Net in a CLR function, but I don’t have CLR enabled on this server).
Cheers
DB Dave
Thank you for a great post.