Removing accented and illegal characters from a string

This post isn’t related to the series I’m currently writing on Full-Text search operators, but it is in the same ball-park so I thought I’d throw it up here.

Today I was working on our search term auto-suggest system (i.e. the system that shows a list of suggestions in a drop-down list below the search box as you type your search term in), and needed to improve the way we strip out “illegal” characters from our dictionary terms.

Illegal characters in our case refers to pretty much anything that isn’t alphanumeric, or a space, dollar sign, ampersand, hyphen, or period.  We don’t even want any accented characters (i.e. letters with diacritics) – although in this case we want to replace accented characters with their un-accented counterparts (e.g. Å should be replaced by A).

The way I did this (and I’m sure there are many, many better ways) is to insert illegal characters into a table variable, along with a replacement character, and then update the input variable for each row in this table variable – replacing any matching illegal characters each time with it’s replacement value.

IF OBJECT_ID('[dbo].[fn_clean_search_term]') IS NULL
	EXEC('CREATE FUNCTION [dbo].[fn_clean_search_term] () RETURNS INT AS BEGIN RETURN(0); END ');
GO
/***************************************************************************************************************************
This accepts a string input, and outputs the same string, but with any invalid characters removed, or replaced if they
are accented characters. Note: not all accented characters are accounted for, only the most common ones in our data.
Also, this does not handle Unicode characters.

Author : David Curlewis
Date : 07/2009

This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License
http://creativecommons.org/licenses/by-nc/3.0/
****************************************************************************************************************************/
ALTER FUNCTION [dbo].[fn_clean_search_term] (
	@keyword VARCHAR(2048)
) RETURNS VARCHAR(2048) WITH SCHEMABINDING
AS
BEGIN

	-- Only do the heavy lifting if there are non-alphanumeric characters in the string
	IF PATINDEX('%[^A-Za-z0-9 ]%', @keyword) > 0
	BEGIN

		-- if the string contains anything other than "valid" characters, strip them out
		DECLARE @chars TABLE ([ascii] TINYINT NOT NULL, [new_char] VARCHAR(1));

		;WITH illegal_chars AS
		(
			SELECT	0 AS [ascii]
			UNION ALL
			SELECT	[ascii] + 1
			FROM	illegal_chars
			WHERE	[ascii] < 255
		)
		INSERT	@chars ([ascii], [new_char])
		SELECT	[ascii],
				CASE -- Replace accented letters with non-accented letters to normalise the search terms
				-- Unless your collation is case sensitive, the lower &amp; upper case variants are equivalent
				WHEN CHAR([ascii]) IN ('á','à','â','ä','å') THEN 'a'
				WHEN CHAR([ascii]) IN ('Á','Ã','Ä','Å','À','Â') THEN 'A'
				WHEN CHAR([ascii]) IN ('ç') THEN 'c'
				WHEN CHAR([ascii]) IN ('Ç') THEN 'C'
				WHEN CHAR([ascii]) IN ('ê','é','ë','è') THEN 'e'
				WHEN CHAR([ascii]) IN ('Ê','Ë','É','È') THEN 'E'
				WHEN CHAR([ascii]) IN ('ï','í','î','ì') THEN 'i'
				WHEN CHAR([ascii]) IN ('Í','Ì','Î','Ï') THEN 'I'
				WHEN CHAR([ascii]) IN ('ñ') THEN 'n'
				WHEN CHAR([ascii]) IN ('Ñ') THEN 'N'
				WHEN CHAR([ascii]) IN ('ô','ö','ò','õ','ó','ø') THEN 'o'
				WHEN CHAR([ascii]) IN ('Ó','Ô','Õ','Ø','Ö','Ò') THEN 'O'
				WHEN CHAR([ascii]) IN ('š') THEN 's'
				WHEN CHAR([ascii]) IN ('Š') THEN 'S'
				WHEN CHAR([ascii]) IN ('ú','ü','û','ù') THEN 'u'
				WHEN CHAR([ascii]) IN ('Ù','Ú','Ü','Û') THEN 'U'
				WHEN CHAR([ascii]) IN ('ÿ','ý') THEN 'y'
				WHEN CHAR([ascii]) IN ('Ÿ','Ý') THEN 'Y'
				WHEN CHAR([ascii]) IN ('ž') THEN 'z'
				WHEN CHAR([ascii]) IN ('Ž') THEN 'Z'
				ELSE '' -- remove the character
				END AS [new_char]
		FROM	illegal_chars
		WHERE	[ascii] NOT BETWEEN 48 AND 57 -- 0 -> 9
		AND		[ascii] NOT BETWEEN 65 AND 90 -- A -> Z
		AND		[ascii] NOT BETWEEN 97 AND 122 -- a -> z
		AND		[ascii] NOT IN ( -- explicitly allow the following characters
					32, -- <space>
					36, -- $
					38, -- &amp;
					45, -- -
					46 -- .
					)
		OPTION (MAXRECURSION 255);

		UPDATE @chars
		SET @keyword = REPLACE(@keyword, CHAR([ascii]), [new_char])

	END

	-- remove multiple spaces
	WHILE CHARINDEX(' ', @keyword) > 0
	SELECT @keyword = REPLACE(@keyword, ' ', ' ');

	SELECT @keyword = LTRIM(RTRIM(@keyword)); -- trim leading &amp; trailing whitespace
	SELECT @keyword = LOWER(@keyword); -- return lower-case only

	RETURN @keyword;
END
GO

Let me know if you have any questions, or different/better ways of doing this (yes, I know I could do it using a couple of lines of .Net in a CLR function, but I don’t have CLR enabled on this server). Smile

Cheers

DB Dave

1 thought on “Removing accented and illegal characters from a string”

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top