This post isn’t related to the series I’m currently writing on Full-Text search operators, but it is in the same ball-park so I thought I’d throw it up here.
Today I was working on our search term auto-suggest system (i.e. the system that shows a list of suggestions in a drop-down list below the search box as you type your search term in), and needed to improve the way we strip out “illegal” characters from our dictionary terms.
Illegal characters in our case refers to pretty much anything that isn’t alphanumeric, or a space, dollar sign, ampersand, hyphen, or period. We don’t even want any accented characters (i.e. letters with diacritics) – although in this case we want to replace accented characters with their un-accented counterparts (e.g. Å should be replaced by A).
The way I did this (and I’m sure there are many, many better ways) is to insert illegal characters into a table variable, along with a replacement character, and then update the input variable for each row in this table variable – replacing any matching illegal characters each time with it’s replacement value.
IF OBJECT_ID('[dbo].[fn_clean_search_term]') IS NULL
EXEC('CREATE FUNCTION [dbo].[fn_clean_search_term] () RETURNS INT AS BEGIN RETURN(0); END ');
GO
/***************************************************************************************************************************
This accepts a string input, and outputs the same string, but with any invalid characters removed, or replaced if they
are accented characters. Note: not all accented characters are accounted for, only the most common ones in our data.
Also, this does not handle Unicode characters.
Author : David Curlewis
Date : 07/2009
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
http://creativecommons.org/licenses/by-nc/3.0/
****************************************************************************************************************************/
ALTER FUNCTION [dbo].[fn_clean_search_term] (
@keyword VARCHAR(2048)
) RETURNS VARCHAR(2048) WITH SCHEMABINDING
AS
BEGIN
-- Only do the heavy lifting if there are non-alphanumeric characters in the string
IF PATINDEX('%[^A-Za-z0-9 ]%', @keyword) > 0
BEGIN
-- if the string contains anything other than "valid" characters, strip them out
DECLARE @chars TABLE ([ascii] TINYINT NOT NULL, [new_char] VARCHAR(1));
;WITH illegal_chars AS
(
SELECT 0 AS [ascii]
UNION ALL
SELECT [ascii] + 1
FROM illegal_chars
WHERE [ascii] < 255
)
INSERT @chars ([ascii], [new_char])
SELECT [ascii],
CASE -- Replace accented letters with non-accented letters to normalise the search terms
-- Unless your collation is case sensitive, the lower & upper case variants are equivalent
WHEN CHAR([ascii]) IN ('á','à','â','ä','å') THEN 'a'
WHEN CHAR([ascii]) IN ('Á','Ã','Ä','Å','À','Â') THEN 'A'
WHEN CHAR([ascii]) IN ('ç') THEN 'c'
WHEN CHAR([ascii]) IN ('Ç') THEN 'C'
WHEN CHAR([ascii]) IN ('ê','é','ë','è') THEN 'e'
WHEN CHAR([ascii]) IN ('Ê','Ë','É','È') THEN 'E'
WHEN CHAR([ascii]) IN ('ï','í','î','ì') THEN 'i'
WHEN CHAR([ascii]) IN ('Í','Ì','Î','Ï') THEN 'I'
WHEN CHAR([ascii]) IN ('ñ') THEN 'n'
WHEN CHAR([ascii]) IN ('Ñ') THEN 'N'
WHEN CHAR([ascii]) IN ('ô','ö','ò','õ','ó','ø') THEN 'o'
WHEN CHAR([ascii]) IN ('Ó','Ô','Õ','Ø','Ö','Ò') THEN 'O'
WHEN CHAR([ascii]) IN ('š') THEN 's'
WHEN CHAR([ascii]) IN ('Š') THEN 'S'
WHEN CHAR([ascii]) IN ('ú','ü','û','ù') THEN 'u'
WHEN CHAR([ascii]) IN ('Ù','Ú','Ü','Û') THEN 'U'
WHEN CHAR([ascii]) IN ('ÿ','ý') THEN 'y'
WHEN CHAR([ascii]) IN ('Ÿ','Ý') THEN 'Y'
WHEN CHAR([ascii]) IN ('ž') THEN 'z'
WHEN CHAR([ascii]) IN ('Ž') THEN 'Z'
ELSE '' -- remove the character
END AS [new_char]
FROM illegal_chars
WHERE [ascii] NOT BETWEEN 48 AND 57 -- 0 -> 9
AND [ascii] NOT BETWEEN 65 AND 90 -- A -> Z
AND [ascii] NOT BETWEEN 97 AND 122 -- a -> z
AND [ascii] NOT IN ( -- explicitly allow the following characters
32, -- <space>
36, -- $
38, -- &
45, -- -
46 -- .
)
OPTION (MAXRECURSION 255);
UPDATE @chars
SET @keyword = REPLACE(@keyword, CHAR([ascii]), [new_char])
END
-- remove multiple spaces
WHILE CHARINDEX(' ', @keyword) > 0
SELECT @keyword = REPLACE(@keyword, ' ', ' ');
SELECT @keyword = LTRIM(RTRIM(@keyword)); -- trim leading & trailing whitespace
SELECT @keyword = LOWER(@keyword); -- return lower-case only
RETURN @keyword;
END
GO
Let me know if you have any questions, or different/better ways of doing this (yes, I know I could do it using a couple of lines of .Net in a CLR function, but I don’t have CLR enabled on this server). ![]()
Cheers
DB Dave
Comments
One response to “Removing accented and illegal characters from a string”
Thank you for a great post.