嘿,我正在使用Levenshteins算法来获得源和目标字符串之间的距离.
我也有从0到1返回值的方法:
////// Gets the similarity between two strings. /// All relation scores are in the [0, 1] range, /// which means that if the score gets a maximum value (equal to 1) /// then the two string are absolutely similar /// /// The string1. /// The string2. ///public static float CalculateSimilarity(String s1, String s2) { if ((s1 == null) || (s2 == null)) return 0.0f; float dis = LevenshteinDistance.Compute(s1, s2); float maxLen = s1.Length; if (maxLen < s2.Length) maxLen = s2.Length; if (maxLen == 0.0F) return 1.0F; else return 1.0F - dis / maxLen; }
但这对我来说还不够.因为我需要更复杂的方式来匹配两个句子.
例如,我想自动标记一些音乐,我有原创歌曲名称,我有垃圾的歌曲,如超级,质量,像2007年,2008年等等...也有些文件只有http://垃圾. .thash..song_name_mp3.mp3,其他都很正常.我想创建一个比现在更完美的算法..也许有人可以帮助我吗?
这是我目前的算法:
////// if we need to ignore this target. /// /// The target string. ///private bool doIgnore(String targetString) { if ((targetString != null) && (targetString != String.Empty)) { for (int i = 0; i < ignoreWordsList.Length; ++i) { //* if we found ignore word or target string matching some some special cases like years (Regex). if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true; } } return false; } /// /// Removes the duplicates. /// /// The list. private void removeDuplicates(Listlist) { if ((list != null) && (list.Count > 0)) { for (int i = 0; i < list.Count - 1; ++i) { if (list[i] == list[i + 1]) { list.RemoveAt(i); --i; } } } } /// /// Does the fuzzy match. /// /// The target title. ///private TitleMatchResult doFuzzyMatch(String targetTitle) { TitleMatchResult matchResult = null; if (targetTitle != null && targetTitle != String.Empty) { try { //* change target title (string) to lower case. targetTitle = targetTitle.ToLower(); //* scores, we will select higher score at the end. Dictionary scores = new Dictionary (); //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_' List targetKeywords = new List (targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); //* remove all trash from keywords, like super, quality, etc.. targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); }); //* sort keywords. targetKeywords.Sort(); //* remove some duplicates. removeDuplicates(targetKeywords); //* go through all original titles. foreach (Title sourceTitle in titles) { float tempScore = 0f; //* split orig. title to keywords list. List sourceKeywords = new List (sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries)); sourceKeywords.Sort(); removeDuplicates(sourceKeywords); //* go through all source ttl keywords. foreach (String keyw1 in sourceKeywords) { float max = float.MinValue; foreach (String keyw2 in targetKeywords) { float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2); if (currentScore > max) { max = currentScore; } } tempScore += max; } //* calculate average score. float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); //* if average score is bigger than minimal score and target title is not in this source title ignore list. if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle)) { //* add score. scores.Add(sourceTitle, averageScore); } } //* choose biggest score. float maxi = float.MinValue; foreach (KeyValuePair kvp in scores) { if (kvp.Value > maxi) { maxi = kvp.Value; matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic); } } } catch { } } //* return result. return matchResult; }
这种方法正常,但在某些情况下,很多标题应该匹配,不匹配......我想我需要某种公式来玩权重等等,但我想不出一个......
想法?建议?交易算法?
顺便说一句,我已经知道了这个话题(我的同事已经发布了它,但我们无法为这个问题找到合适的解决方案.): 近似字符串匹配算法
有点旧,但它可能对未来的访客有用.如果您已经在使用Levenshtein算法并且需要更好一些,我将在此解决方案中描述一些非常有效的启发式方法:
获得最接近的字符串匹配
关键是你想出3或4(或更多)方法来衡量你的短语之间的相似性(Levenshtein距离只是一种方法) - 然后使用你想要匹配的字符串的真实例子,你调整权重和那些启发式的组合,直到你获得最大化积极匹配数量的东西.然后你将这个公式用于所有未来的比赛,你应该看到很好的结果.
如果用户参与了该过程,那么最好还提供一个界面,允许用户在不同意第一选择的情况下查看排名相似的其他匹配.
以下是相关答案的摘录.如果您最终想要使用任何此类代码,我会提前道歉,因为必须将VBA转换为C#.
简单,快速,非常有用的指标.使用这个,我创建了两个单独的度量标准来评估两个字符串的相似性.一个我称之为"valuePhrase",一个我称之为"valueWords".valuePhrase只是两个短语之间的Levenshtein距离,valueWords根据分隔符(如空格,破折号和其他任何你想要的东西)将字符串分成单个单词,并将每个单词与其他单词进行比较,总结最短的单词Levenshtein距离连接任何两个单词.从本质上讲,它衡量一个"短语"中的信息是否真的包含在另一个"短语"中,就像单词排列一样.我花了几天作为一个侧面项目,提出了基于分隔符分割字符串的最有效方法.
valueWords,valuePhrase和Split函数:
Public Function valuePhrase#(ByRef S1$, ByRef S2$) valuePhrase = LevenshteinDistance(S1, S2) End Function Public Function valueWords#(ByRef S1$, ByRef S2$) Dim wordsS1$(), wordsS2$() wordsS1 = SplitMultiDelims(S1, " _-") wordsS2 = SplitMultiDelims(S2, " _-") Dim word1%, word2%, thisD#, wordbest# Dim wordsTotal# For word1 = LBound(wordsS1) To UBound(wordsS1) wordbest = Len(S2) For word2 = LBound(wordsS2) To UBound(wordsS2) thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2)) If thisD < wordbest Then wordbest = thisD If thisD = 0 Then GoTo foundbest Next word2 foundbest: wordsTotal = wordsTotal + wordbest Next word1 valueWords = wordsTotal End Function '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' ' SplitMultiDelims ' This function splits Text into an array of substrings, each substring ' delimited by any character in DelimChars. Only a single character ' may be a delimiter between two substrings, but DelimChars may ' contain any number of delimiter characters. It returns a single element ' array containing all of text if DelimChars is empty, or a 1 or greater ' element array if the Text is successfully split into substrings. ' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur. ' If Limit greater than 0, the function will only split Text into 'Limit' ' array elements or less. The last element will contain the rest of Text. '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _ Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _ Optional ByVal Limit As Long = -1) As String() Dim ElemStart As Long, N As Long, M As Long, Elements As Long Dim lDelims As Long, lText As Long Dim Arr() As String lText = Len(Text) lDelims = Len(DelimChars) If lDelims = 0 Or lText = 0 Or Limit = 1 Then ReDim Arr(0 To 0) Arr(0) = Text SplitMultiDelims = Arr Exit Function End If ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit)) Elements = 0: ElemStart = 1 For N = 1 To lText If InStr(DelimChars, Mid(Text, N, 1)) Then Arr(Elements) = Mid(Text, ElemStart, N - ElemStart) If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) > 0 Then Elements = Elements + 1 Else Elements = Elements + 1 End If ElemStart = N + 1 If Elements + 1 = Limit Then Exit For End If Next N 'Get the last token terminated by the end of the string into the array If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart) 'Since the end of string counts as the terminating delimiter, if the last character 'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1 ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements SplitMultiDelims = Arr End Function
相似度量
使用这两个指标,第三个只是计算两个字符串之间的距离,我有一系列变量,我可以运行一个优化算法来实现最大数量的匹配.模糊字符串匹配本身就是一种模糊科学,因此通过创建线性独立的度量来测量字符串相似性,并且我们希望彼此匹配的已知字符串集合,我们可以找到针对我们特定样式的参数.字符串,给出最佳的模糊匹配结果.
最初,度量标准的目标是为完全匹配提供较低的搜索值,并为越来越多的置换度量增加搜索值.在一个不切实际的情况下,使用一组明确定义的排列很容易定义,并且设计最终公式使得它们具有所需的增加的搜索值结果.
正如您所看到的,最后两个指标(模糊字符串匹配指标)已经自然倾向于为要匹配的字符串(沿着对角线)提供低分数.这是非常好的.
应用 为了优化模糊匹配,我对每个度量进行加权.因此,模糊字符串匹配的每个应用可以不同地加权参数.定义最终得分的公式是指标及其权重的简单组合:
value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight + Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight + lengthWeight*lengthValue
使用优化算法(神经网络在这里是最好的,因为它是一个离散的,多维的问题),现在的目标是最大化匹配的数量.我创建了一个函数,可以检测每个集合之间正确匹配的数量,如最终屏幕截图所示.如果为最低分数分配了要匹配的字符串,则列或行获得一个点;如果最低分数存在平局,则给出部分分数,并且正确匹配在绑定的匹配字符串中.然后我优化了它.您可以看到绿色单元格是与当前行最匹配的列,单元格周围的蓝色方块是与当前列最匹配的行.底角的分数大致是成功匹配的数量,这就是我们告诉我们的优化问题最大化.
听起来你想要的可能是最长的子串匹配.也就是说,在你的例子中,两个文件就像
trash..thash..song_name_mp3.mp3和garbage..spotch..song_name_mp3.mp3
最终会看起来一样.
当然,你需要一些启发式方法.你可能尝试的一件事是将字符串放入soundex转换器.Soundex是用于查看事物"听起来"是否相同的"编解码器"(您可能会告诉电话接线员).它或多或少是一个粗略的语音和错误发音半证明音译.它绝对比编辑距离更差,但更便宜.(官方用于名称,只使用三个字符.但是没有理由停在那里,只需使用字符串中每个字符的映射.有关详细信息,请参阅维基百科)
所以我的建议是将你的字符串索引,将每个字符串分成几个长度的部分(例如5,10,20),然后只看集群.在群集中,您可以使用更昂贵的内容,例如编辑距离或最大子字符串.
您的问题可能是区分噪音词和有用数据:
Rolling_Stones.Best_of_2003.Wild_Horses.mp3
Super.Quality.Wild_Horses.mp3
Tori_Amos.Wild_Horses.mp3
您可能需要生成要忽略的噪音词典.这看起来很笨重,但我不确定是否有一种可以区分乐队/专辑名称和噪音的算法.