我正在寻找一种字符串相似度算法,它可以在变长字符串上产生比通常建议的更好的结果(levenshtein距离,soundex等).
例如,
鉴于字符串A:"罗伯特",
然后是字符串B:"Amy Robertson"
会比一个更好的比赛
字符串C:"理查德"
此外,优选地,该算法应该是语言不可知的(也可以用于除英语之外的语言).
Catalysoft的西蒙怀特写了一篇关于一个非常聪明的算法的文章,该算法比较了相邻的字符对,这些字符对非常适用于我的目的:
http://www.catalysoft.com/articles/StrikeAMatch.html
Simon有一个Java版本的算法,下面我写了一个PL/Ruby版本(取自Mark Wong-VanHaren在相关论坛条目评论中完成的普通ruby版本),以便我可以在我的PostgreSQL查询中使用它:
CREATE FUNCTION string_similarity(str1 varchar, str2 varchar) RETURNS float8 AS ' str1.downcase! pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject { |pair| pair.include? " "} str2.downcase! pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject { |pair| pair.include? " "} union = pairs1.size + pairs2.size intersection = 0 pairs1.each do |p1| 0.upto(pairs2.size-1) do |i| if p1 == pairs2[i] intersection += 1 pairs2.slice!(i) break end end end (2.0 * intersection) / union ' LANGUAGE 'plruby';
奇迹般有效!
marzagao的答案很棒.我将它转换为C#所以我想我会在这里发布它:
Pastebin Link
///
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
///
public class SimilarityTool
{
///
/// Compares the two strings based on letter pair matches
///
///
///
/// The percentage match from 0.0 to 1.0 where 1.0 is 100%
public double CompareStrings(string str1, string str2)
{
List pairs1 = WordLetterPairs(str1.ToUpper());
List pairs2 = WordLetterPairs(str2.ToUpper());
int intersection = 0;
int union = pairs1.Count + pairs2.Count;
for (int i = 0; i < pairs1.Count; i++)
{
for (int j = 0; j < pairs2.Count; j++)
{
if (pairs1[i] == pairs2[j])
{
intersection++;
pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
break;
}
}
}
return (2.0 * intersection) / union;
}
///
/// Gets all letter pairs for each
/// individual word in the string
///
///
///
private List WordLetterPairs(string str)
{
List AllPairs = new List();
// Tokenize the string and put the tokens/words into an array
string[] Words = Regex.Split(str, @"\s");
// For each word
for (int w = 0; w < Words.Length; w++)
{
if (!string.IsNullOrEmpty(Words[w]))
{
// Find the pairs of characters
String[] PairsInWord = LetterPairs(Words[w]);
for (int p = 0; p < PairsInWord.Length; p++)
{
AllPairs.Add(PairsInWord[p]);
}
}
}
return AllPairs;
}
///
/// Generates an array containing every
/// two consecutive letters in the input string
///
///
///
private string[] LetterPairs(string str)
{
int numPairs = str.Length - 1;
string[] pairs = new string[numPairs];
for (int i = 0; i < numPairs; i++)
{
pairs[i] = str.Substring(i, 2);
}
return pairs;
}
}
这是marzagao的另一个版本的答案,这是用Python编写的:
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
break
return (2.0 * hit_count) / union
if __name__ == "__main__":
"""
Run a test using the example taken from:
http://www.catalysoft.com/articles/StrikeAMatch.html
"""
w1 = 'Healed'
words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']
for w2 in words:
print('Healed --- ' + w2)
print(string_similarity(w1, w2))
print()
这是Simon White建议的StrikeAMatch算法的PHP实现.优点(如链接中所说)是:
词汇相似性的真实反映 - 具有小差异的字符串应该被认为是相似的.特别是,重要的子串重叠应指向字符串之间的高度相似性.
对词序变化的鲁棒性 - 两个字符串包含相同的单词,但顺序不同,应该被认为是相似的.另一方面,如果一个字符串只是另一个字符串中包含的字符的随机字谜,则应该(通常)将其识别为不相似.
语言独立 - 算法不仅应该用于英语,还应该用于许多不同的语言.
letterPairs($words[$w]);
for ($p = 0; $p < count($pairsInWord); $p++)
{
$allPairs[] = $pairsInWord[$p];
}
}
return $allPairs;
}
/**
* @param $str
* @return array
*/
private function letterPairs($str)
{
$numPairs = mb_strlen($str)-1;
$pairs = array();
for ($i = 0; $i < $numPairs; $i++)
{
$pairs[$i] = mb_substr($str,$i,2);
}
return $pairs;
}
/**
* @param $str1
* @param $str2
* @return float
*/
public function compareStrings($str1, $str2)
{
$pairs1 = $this->wordLetterPairs(strtoupper($str1));
$pairs2 = $this->wordLetterPairs(strtoupper($str2));
$intersection = 0;
$union = count($pairs1) + count($pairs2);
for ($i=0; $i < count($pairs1); $i++)
{
$pair1 = $pairs1[$i];
$pairs2 = array_values($pairs2);
for($j = 0; $j < count($pairs2); $j++)
{
$pair2 = $pairs2[$j];
if ($pair1 === $pair2)
{
$intersection++;
unset($pairs2[$j]);
break;
}
}
}
return (2.0*$intersection)/$union;
}
}
约翰拉特利奇答案的缩短版本:
def get_bigrams(string):
'''
Takes a string and returns a list of bigrams
'''
s = string.lower()
return {s[i:i+2] for i in xrange(len(s) - 1)}
def string_similarity(str1, str2):
'''
Perform bigram comparison between two strings
and return a percentage match in decimal form
'''
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))
这个讨论非常有用,谢谢.我将算法转换为VBA以便与Excel一起使用,并编写了几个版本的工作表函数,一个用于简单比较一对字符串,另一个用于比较一个字符串到一个范围/字符串数组.strSimLookup版本返回最后一个最佳匹配字符串,数组索引或相似性度量标准.
此实现产生的结果与Simon White网站上的亚马逊示例中列出的结果相同,但在得分较低的匹配上有一些小的例外情况; 不确定差异在哪里蔓延,可能是VBA的分裂功能,但我没有调查,因为它的工作正常我的目的.
'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source: http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010
Option Explicit
Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Set sPairs1 = New Collection
Set sPairs2 = New Collection
WordLetterPairs str1, sPairs1
WordLetterPairs str2, sPairs2
stringSimilarity = SimilarityMetric(sPairs1, sPairs2)
Set sPairs1 = Nothing
Set sPairs2 = Nothing
End Function
Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Dim arrOut As Variant
Dim l As Long, j As Long
Set sPairs1 = New Collection
WordLetterPairs CStr(str1), sPairs1
l = rRng.Count
ReDim arrOut(1 To l)
For j = 1 To l
Set sPairs2 = New Collection
WordLetterPairs CStr(rRng(j)), sPairs2
arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
Set sPairs2 = Nothing
Next j
strSimA = Application.Transpose(arrOut)
End Function
Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1 : returns the index of the best matching string
' returnType = 2 : returns the similarity metric
Dim sPairs1 As Collection
Dim sPairs2 As Collection
Dim metric, bestMetric As Double
Dim i, iBest As Long
Const RETURN_STRING As Integer = 0
Const RETURN_INDEX As Integer = 1
Const RETURN_METRIC As Integer = 2
If IsMissing(returnType) Then returnType = RETURN_STRING
Set sPairs1 = New Collection
WordLetterPairs CStr(str1), sPairs1
bestMetric = -1
iBest = -1
For i = 1 To rRng.Count
Set sPairs2 = New Collection
WordLetterPairs CStr(rRng(i)), sPairs2
metric = SimilarityMetric(sPairs1, sPairs2)
If metric > bestMetric Then
bestMetric = metric
iBest = i
End If
Set sPairs2 = Nothing
Next i
If iBest = -1 Then
strSimLookup = CVErr(xlErrValue)
Exit Function
End If
Select Case returnType
Case RETURN_STRING
strSimLookup = CStr(rRng(iBest))
Case RETURN_INDEX
strSimLookup = iBest
Case Else
strSimLookup = bestMetric
End Select
End Function
Public Function strSim(str1 As String, str2 As String) As Variant
Dim ilen, iLen1, ilen2 As Integer
iLen1 = Len(str1)
ilen2 = Len(str2)
If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1
strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))
End Function
Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl
Dim Words() As String
Dim word, nPairs, pair As Integer
Words = Split(str)
If UBound(Words) < 0 Then
Set pairColl = Nothing
Exit Sub
End If
For word = 0 To UBound(Words)
nPairs = Len(Words(word)) - 1
If nPairs > 0 Then
For pair = 1 To nPairs
pairColl.Add Mid(Words(word), pair, 2)
Next pair
End If
Next word
End Sub
Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else
Dim Intersect As Double
Dim Union As Double
Dim i, j As Long
If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
SimilarityMetric = CVErr(xlErrNA)
Exit Function
End If
Union = sPairs1.Count + sPairs2.Count
Intersect = 0
For i = 1 To sPairs1.Count
For j = 1 To sPairs2.Count
If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
Intersect = Intersect + 1
sPairs2.Remove j
Exit For
End If
Next j
Next i
SimilarityMetric = (2 * Intersect) / Union
End Function
对不起,答案不是作者发明的.这是众所周知的算法,其首先由Digital Equipment Corporation提出并且通常被称为搭迭.
http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf
我将Simon White的算法翻译成了PL/pgSQL.这是我的贡献.
create or replace function spt1.letterpairs(in p_str varchar) returns varchar as $$ declare v_numpairs integer := length(p_str)-1; v_pairs varchar[]; begin for i in 1 .. v_numpairs loop v_pairs[i] := substr(p_str, i, 2); end loop; return v_pairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.wordletterpairs(in p_str varchar) returns varchar as $$ declare v_allpairs varchar[]; v_words varchar[]; v_pairsinword varchar[]; begin v_words := regexp_split_to_array(p_str, '[[:space:]]'); for i in 1 .. array_length(v_words, 1) loop v_pairsinword := spt1.letterpairs(v_words[i]); if v_pairsinword is not null then for j in 1 .. array_length(v_pairsinword, 1) loop v_allpairs := v_allpairs || v_pairsinword[j]; end loop; end if; end loop; return v_allpairs; end; $$ language 'plpgsql'; --=================================================================== create or replace function spt1.arrayintersect(ANYARRAY, ANYARRAY) returns anyarray as $$ select array(select unnest($1) intersect select unnest($2)) $$ language 'sql'; --=================================================================== create or replace function spt1.comparestrings(in p_str1 varchar, in p_str2 varchar) returns float as $$ declare v_pairs1 varchar[]; v_pairs2 varchar[]; v_intersection integer; v_union integer; begin v_pairs1 := wordletterpairs(upper(p_str1)); v_pairs2 := wordletterpairs(upper(p_str2)); v_union := array_length(v_pairs1, 1) + array_length(v_pairs2, 1); v_intersection := array_length(arrayintersect(v_pairs1, v_pairs2), 1); return (2.0 * v_intersection / v_union); end; $$ language 'plpgsql';
更快的PHP版本的算法:
/**
*
* @param $str
* @return mixed
*/
private static function wordLetterPairs ($str)
{
$allPairs = array();
// Tokenize the string and put the tokens/words into an array
$words = explode(' ', $str);
// For each word
for ($w = 0; $w < count($words); $w ++) {
// Find the pairs of characters
$pairsInWord = self::letterPairs($words[$w]);
for ($p = 0; $p < count($pairsInWord); $p ++) {
$allPairs[$pairsInWord[$p]] = $pairsInWord[$p];
}
}
return array_values($allPairs);
}
/**
*
* @param $str
* @return array
*/
private static function letterPairs ($str)
{
$numPairs = mb_strlen($str) - 1;
$pairs = array();
for ($i = 0; $i < $numPairs; $i ++) {
$pairs[$i] = mb_substr($str, $i, 2);
}
return $pairs;
}
/**
*
* @param $str1
* @param $str2
* @return float
*/
public static function compareStrings ($str1, $str2)
{
$pairs1 = self::wordLetterPairs(mb_strtolower($str1));
$pairs2 = self::wordLetterPairs(mb_strtolower($str2));
$union = count($pairs1) + count($pairs2);
$intersection = count(array_intersect($pairs1, $pairs2));
return (2.0 * $intersection) / $union;
}
对于我的数据(约2300次比较),Igal Alkon解决方案的运行时间为0.58秒,而我的运行时间为0.35 秒.
美丽的Scala中的一个版本:
def pairDistance(s1: String, s2: String): Double = {
def strToPairs(s: String, acc: List[String]): List[String] = {
if (s.size < 2) acc
else strToPairs(s.drop(1),
if (s.take(2).contains(" ")) acc else acc ::: List(s.take(2)))
}
val lst1 = strToPairs(s1.toUpperCase, List())
val lst2 = strToPairs(s2.toUpperCase, List())
(2.0 * lst2.intersect(lst1).size) / (lst1.size + lst2.size)
}
String Similarity Metrics包含字符串比较中使用的许多不同指标的概述(Wikipedia也有概述).这些指标中的大部分都是在库simmetrics中实现的.
另一个度量的示例(未包括在给定概述中)是例如压缩距离(试图近似于Kolmogorov的复杂度),其可用于比您呈现的文本更长的文本.
您也可以考虑查看更广泛的自然语言处理主题.这些 R包可以让您快速入门(或者至少提供一些想法).
最后一个编辑 - 在SO搜索关于这个主题的其他问题,有很多相关的问题.
这是R版本:
get_bigrams <- function(str) { lstr = tolower(str) bigramlst = list() for(i in 1:(nchar(str)-1)) { bigramlst[[i]] = substr(str, i, i+1) } return(bigramlst) } str_similarity <- function(str1, str2) { pairs1 = get_bigrams(str1) pairs2 = get_bigrams(str2) unionlen = length(pairs1) + length(pairs2) hit_count = 0 for(x in 1:length(pairs1)){ for(y in 1:length(pairs2)){ if (pairs1[[x]] == pairs2[[y]]) hit_count = hit_count + 1 } } return ((2.0 * hit_count) / unionlen) }
在这些算法的启发下,在C99中发布marzagao的答案
double dice_match(const char *string1, const char *string2) {
//check fast cases
if (((string1 != NULL) && (string1[0] == '\0')) ||
((string2 != NULL) && (string2[0] == '\0'))) {
return 0;
}
if (string1 == string2) {
return 1;
}
size_t strlen1 = strlen(string1);
size_t strlen2 = strlen(string2);
if (strlen1 < 2 || strlen2 < 2) {
return 0;
}
size_t length1 = strlen1 - 1;
size_t length2 = strlen2 - 1;
double matches = 0;
int i = 0, j = 0;
//get bigrams and compare
while (i < length1 && j < length2) {
char a[3] = {string1[i], string1[i + 1], '\0'};
char b[3] = {string2[j], string2[j + 1], '\0'};
int cmp = strcmpi(a, b);
if (cmp == 0) {
matches += 2;
}
i++;
j++;
}
return matches / (length1 + length2);
}
一些测试基于原始文章:
#include
void article_test1() {
char *string1 = "FRANCE";
char *string2 = "FRENCH";
printf("====%s====\n", __func__);
printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}
void article_test2() {
printf("====%s====\n", __func__);
char *string = "Healed";
char *ss[] = {"Heard", "Healthy", "Help",
"Herded", "Sealed", "Sold"};
int correct[] = {44, 55, 25, 40, 80, 0};
for (int i = 0; i < 6; ++i) {
printf("%2.f%% == %d%%\n", dice_match(string, ss[i]) * 100, correct[i]);
}
}
void multicase_test() {
char *string1 = "FRaNcE";
char *string2 = "fREnCh";
printf("====%s====\n", __func__);
printf("%2.f%% == 40%%\n", dice_match(string1, string2) * 100);
}
void gg_test() {
char *string1 = "GG";
char *string2 = "GGGGG";
printf("====%s====\n", __func__);
printf("%2.f%% != 100%%\n", dice_match(string1, string2) * 100);
}
int main() {
article_test1();
article_test2();
multicase_test();
gg_test();
return 0;
}
我的JavaScript实现采用字符串或字符串数组,以及可选的楼层(默认楼层为0.5).如果您传递一个字符串,它将返回true或false,具体取决于字符串的相似性得分是否大于或等于发言权.如果传递一个字符串数组,它将返回一个数组,这些字符串的相似性得分大于或等于发言权,按得分排序.
例子:
'Healed'.fuzzy('Sealed'); // returns true
'Healed'.fuzzy('Help'); // returns false
'Healed'.fuzzy('Help', 0.25); // returns true
'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy']);
// returns ["Sealed", "Healthy"]
'Healed'.fuzzy(['Sold', 'Herded', 'Heard', 'Help', 'Sealed', 'Healthy'], 0);
// returns ["Sealed", "Healthy", "Heard", "Herded", "Help", "Sold"]
这里是:
(function(){
var default_floor = 0.5;
function pairs(str){
var pairs = []
, length = str.length - 1
, pair;
str = str.toLowerCase();
for(var i = 0; i < length; i++){
pair = str.substr(i, 2);
if(!/\s/.test(pair)){
pairs.push(pair);
}
}
return pairs;
}
function similarity(pairs1, pairs2){
var union = pairs1.length + pairs2.length
, hits = 0;
for(var i = 0; i < pairs1.length; i++){
for(var j = 0; j < pairs1.length; j++){
if(pairs1[i] == pairs2[j]){
pairs2.splice(j--, 1);
hits++;
break;
}
}
}
return 2*hits/union || 0;
}
String.prototype.fuzzy = function(strings, floor){
var str1 = this
, pairs1 = pairs(this);
floor = typeof floor == 'number' ? floor : default_floor;
if(typeof(strings) == 'string'){
return str1.length > 1 && strings.length > 1 && similarity(pairs1, pairs(strings)) >= floor || str1.toLowerCase() == strings.toLowerCase();
}else if(strings instanceof Array){
var scores = {};
strings.map(function(str2){
scores[str2] = str1.length > 1 ? similarity(pairs1, pairs(str2)) : 1*(str1.toLowerCase() == str2.toLowerCase());
});
return strings.filter(function(str){
return scores[str] >= floor;
}).sort(function(a, b){
return scores[b] - scores[a];
});
}
};
})();
为了您的方便,这里有一个缩小版本:
(function(){function g(a){var b=[],e=a.length-1,d;a=a.toLowerCase();for(var c=0;c=b||e.toLowerCase()==a.toLowerCase();if(a instanceof Array){var c={};a.map(function(a){c[a]=1=b}).sort(function(a,b){return c[b]-c[a]})}}})();
基于Michael La Voie令人敬畏的C#版本,根据使其成为扩展方法的请求,这就是我想出的.这样做的主要好处是您可以按百分比匹配对通用列表进行排序.例如,假设您的对象中有一个名为"City"的字符串字段.用户搜索"Chester"并且您希望按匹配的降序返回结果.例如,你想要切斯特的文字比赛出现在罗切斯特之前.为此,请向对象添加两个新属性:
public string SearchText { get; set; } public double PercentMatch { get { return City.ToUpper().PercentMatchTo(this.SearchText.ToUpper()); } }
然后在每个对象上,将SearchText设置为用户搜索的内容.然后你可以用以下方法轻松地对它进行排序:
zipcodes = zipcodes.OrderByDescending(x => x.PercentMatch);
这是稍作修改,使其成为一种扩展方法:
////// This class implements string comparison algorithm /// based on character pair similarity /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html /// public static double PercentMatchTo(this string str1, string str2) { Listpairs1 = WordLetterPairs(str1.ToUpper()); List pairs2 = WordLetterPairs(str2.ToUpper()); int intersection = 0; int union = pairs1.Count + pairs2.Count; for (int i = 0; i < pairs1.Count; i++) { for (int j = 0; j < pairs2.Count; j++) { if (pairs1[i] == pairs2[j]) { intersection++; pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success break; } } } return (2.0 * intersection) / union; } /// /// Gets all letter pairs for each /// individual word in the string /// /// ///private static List WordLetterPairs(string str) { List AllPairs = new List (); // Tokenize the string and put the tokens/words into an array string[] Words = Regex.Split(str, @"\s"); // For each word for (int w = 0; w < Words.Length; w++) { if (!string.IsNullOrEmpty(Words[w])) { // Find the pairs of characters String[] PairsInWord = LetterPairs(Words[w]); for (int p = 0; p < PairsInWord.Length; p++) { AllPairs.Add(PairsInWord[p]); } } } return AllPairs; } /// /// Generates an array containing every /// two consecutive letters in the input string /// /// ///private static string[] LetterPairs(string str) { int numPairs = str.Length - 1; string[] pairs = new string[numPairs]; for (int i = 0; i < numPairs; i++) { pairs[i] = str.Substring(i, 2); } return pairs; }