即日起在codingBlog上分享您的技术经验即可获得积分,积分可兑换现金哦。

实施Levenshtein距离MySQL /模糊搜索?

栈溢出 Nathan Koop 255℃ 0评论
本文目录
[隐藏]

1.原始问题:Implementation of Levenshtein distance for mysql/fuzzy search?

I would like to be able to search a table as follows for smith as get everything that it within 1 variance.

Data:

O'Brien
Smithe
Dolan
Smuth
Wong
Smoth
Gunther
Smiht

I have looked into using Levenshtein distance does anyone know how to implement this with it?

2.被采纳答案

Does this help? MySQL Levenshtein distance query

EDIT: The old link Levenshtein Distance as a MySQL stored function (Google Cache) is broken, thanks to Robert for pointing this out in the comment.

3.其他高票答案

3.1.第1个答案

In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you’re looking for full-text search, instead of just a single term per row. Off-hand, I can’t think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.

3.2.第2个答案

An implementation for the damerau-levenshtein distance can be found here:
Damerau-Levenshtein algorithm: Levenshtein with transpositions
The improvement over pure Levenshtein distance is that the swapping of characters is considered. I found it in the comments of schnaader’s link, thanks!

3.3.第3个答案

There is a mysql UDF implementation of Levenshtein Distance function

https://github.com/jmcejuela/Levenshtein-MySQL-UDF

It is implemented in C and has better performance than the “MySQL Levenshtein distance query” mentioned by schnaader

3.4.第4个答案

The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".

I modified the “MySQL Levenshtein distance query” given above, in the first answer, to accept a “limit” that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.

This mod makes it 15% to 50% faster — the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word “giggle,” the original takes 3 min 47 sec on my laptop, versus 1:39 for the “limit” version. Of course, these are both too slow for any real-time use.

Code:

DELIMITER $$
CREATE FUNCTION levenshtein_limit_n( s1 VARCHAR(255), s2 VARCHAR(255), n INT) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost, c_min INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0, c_min = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len and c_min < n DO -- if actual levenshtein dist >= limit, don't bother computing it
        SET s1_char = SUBSTRING(s1, i, 1), c = i, c_min = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            IF c < c_min THEN
              SET c_min = c;
            END IF; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF;
    IF i <= s1_len THEN -- we didn't finish, limit exceeded    
      SET c = c_min; -- actual distance is >= c_min (i.e., the smallest value in the last computed row of the matrix) 
    END IF;
    RETURN c;
  END$$

转载请注明:CodingBlog » 实施Levenshtein距离MySQL /模糊搜索?

喜欢 (0)or分享 (0)
发表我的评论
取消评论

*

表情