1.原始问题：Implementation of Levenshtein distance for mysql/fuzzy search?
I would like to be able to search a table as follows for smith as get everything that it within 1 variance.
O'Brien Smithe Dolan Smuth Wong Smoth Gunther Smiht
I have looked into using Levenshtein distance does anyone know how to implement this with it?
Does this help? MySQL Levenshtein distance query
EDIT: The old link Levenshtein Distance as a MySQL stored function (Google Cache) is broken, thanks to Robert for pointing this out in the comment.
In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you’re looking for full-text search, instead of just a single term per row. Off-hand, I can’t think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.
An implementation for the damerau-levenshtein distance can be found here:
Damerau-Levenshtein algorithm: Levenshtein with transpositions
The improvement over pure Levenshtein distance is that the swapping of characters is considered. I found it in the comments of schnaader’s link, thanks!
There is a mysql UDF implementation of Levenshtein Distance function
It is implemented in C and has better performance than the “MySQL Levenshtein distance query” mentioned by schnaader
The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".
I modified the “MySQL Levenshtein distance query” given above, in the first answer, to accept a “limit” that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.
This mod makes it 15% to 50% faster — the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word “giggle,” the original takes 3 min 47 sec on my laptop, versus 1:39 for the “limit” version. Of course, these are both too slow for any real-time use.
DELIMITER $$ CREATE FUNCTION levenshtein_limit_n( s1 VARCHAR(255), s2 VARCHAR(255), n INT) RETURNS INT DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost, c_min INT; DECLARE s1_char CHAR; -- max strlen=255 DECLARE cv0, cv1 VARBINARY(256); SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0, c_min = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len and c_min < n DO -- if actual levenshtein dist >= limit, don't bother computing it SET s1_char = SUBSTRING(s1, i, 1), c = i, c_min = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; IF c < c_min THEN SET c_min = c; END IF; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; IF i <= s1_len THEN -- we didn't finish, limit exceeded SET c = c_min; -- actual distance is >= c_min (i.e., the smallest value in the last computed row of the matrix) END IF; RETURN c; END$$