我有一个1000个左右的数组,下面是一些例子:
wickedweather liquidweather driveourtrucks gocompact slimprojector
我希望能够将这些分成各自的词,如:
wicked weather liquid weather drive our trucks go compact slim projector
我希望有一个正则表达式,我可以做到这一点.但是,既然没有边界可以停下来,也没有任何我可以关键的大写,我想,有些类型的字典引用可能是必要的吗?
我想它可以手工完成,但为什么 - 什么时候可以用代码完成!=)但这让我很难过.有任何想法吗?
该Viterbi算法的速度要快得多.它在上面的Dmitry答案中计算与递归搜索相同的分数,但是在O(n)时间内.(德米特里的搜索需要指数时间;维特比通过动态编程来完成.)
import re
from collections import Counter
def viterbi_segment(text):
probs, lasts = [1.0], [0]
for i in range(1, len(text) + 1):
prob_k, k = max((probs[j] * word_prob(text[j:i]), j)
for j in range(max(0, i - max_word_length), i))
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
def word_prob(word): return dictionary[word] / total
def words(text): return re.findall('[a-z]+', text.lower())
dictionary = Counter(words(open('big.txt').read()))
max_word_length = max(map(len, dictionary))
total = float(sum(dictionary.values()))
测试它:
>>> viterbi_segment('wickedweather') (['wicked', 'weather'], 5.1518198982768158e-10) >>> ' '.join(viterbi_segment('itseasyformetosplitlongruntogetherblocks')[0]) 'its easy for me to split long run together blocks'
为了实用,您可能需要一些改进:
添加概率日志,不要乘以概率.这可以避免浮点下溢.
您的输入通常会使用不在您的语料库中的单词.必须将这些子字符串分配为非零概率作为单词,否则最终没有解决方案或不良解决方案.(对于上述指数搜索算法也是如此.)这个概率必须从语料库词的概率中抽走,并在所有其他候选词中合理分布:一般主题在统计语言模型中称为平滑.(但是你可以通过一些非常粗糙的黑客来逃避.)这是O(n)维特比算法打破搜索算法的地方,因为考虑非语料库词会炸掉分支因子.
人类能做到吗?
farsidebag far sidebag farside bag far side bag
你不仅需要使用字典,你可能必须使用统计方法来找出最有可能的(或者,上帝禁止,选择人类语言的真实HMM ......)
对于如何进行可能有用的统计数据,我转向Peter Norvig博士,他解决了21行代码中不同但相关的拼写检查问题:http: //norvig.com/spell-correct.html
(他通过将每个for循环折叠成一行来做作弊...但仍然).
更新这一直困在我的脑海里,所以今天我不得不生下它.此代码与Robert Gamble描述的代码类似,但它会根据提供的字典文件中的字频率对结果进行排序(现在预计会出现一些代表您的域名或一般英语的文本.我使用了大字典.来自Norvig的.txt,与上面相关联,并在其上写了一本词典,以涵盖遗漏的单词).
除非频率差异很大,否则大多数时间的两个单词的组合将击败3个单词的组合.
我在我的博客上发布了一些微小的更改代码
http://squarecog.wordpress.com/2008/10/19/splitting-words-joined-into-a-single-string/ 并且还写了一些关于此代码中的下溢错误.我很想安静地解决它,但想到这可能会帮助一些以前没有看过日志技巧的人:http: //squarecog.wordpress.com/2009/01/10/dealing-with-underflow-in-joint-probability-calculations/
输出你的单词,加上我自己的一些 - 注意"orcore"会发生什么:
perl splitwords.pl big.txt words answerveal: 2 possibilities - answer veal - answer ve al wickedweather: 4 possibilities - wicked weather - wicked we at her - wick ed weather - wick ed we at her liquidweather: 6 possibilities - liquid weather - liquid we at her - li quid weather - li quid we at her - li qu id weather - li qu id we at her driveourtrucks: 1 possibilities - drive our trucks gocompact: 1 possibilities - go compact slimprojector: 2 possibilities - slim projector - slim project or orcore: 3 possibilities - or core - or co re - orc ore
码:
#!/usr/bin/env perl use strict; use warnings; sub find_matches($); sub find_matches_rec($\@\@); sub find_word_seq_score(@); sub get_word_stats($); sub print_results($@); sub Usage(); our(%DICT,$TOTAL); { my( $dict_file, $word_file ) = @ARGV; ($dict_file && $word_file) or die(Usage); { my $DICT; ($DICT, $TOTAL) = get_word_stats($dict_file); %DICT = %$DICT; } { open( my $WORDS, '<', $word_file ) or die "unable to open $word_file\n"; foreach my $word (<$WORDS>) { chomp $word; my $arr = find_matches($word); local $_; # Schwartzian Transform my @sorted_arr = map { $_->[0] } sort { $b->[1] <=> $a->[1] } map { [ $_, find_word_seq_score(@$_) ] } @$arr; print_results( $word, @sorted_arr ); } close $WORDS; } } sub find_matches($){ my( $string ) = @_; my @found_parses; my @words; find_matches_rec( $string, @words, @found_parses ); return @found_parses if wantarray; return \@found_parses; } sub find_matches_rec($\@\@){ my( $string, $words_sofar, $found_parses ) = @_; my $length = length $string; unless( $length ){ push @$found_parses, $words_sofar; return @$found_parses if wantarray; return $found_parses; } foreach my $i ( 2..$length ){ my $prefix = substr($string, 0, $i); my $suffix = substr($string, $i, $length-$i); if( exists $DICT{$prefix} ){ my @words = ( @$words_sofar, $prefix ); find_matches_rec( $suffix, @words, @$found_parses ); } } return @$found_parses if wantarray; return $found_parses; } ## Just a simple joint probability ## assumes independence between words, which is obviously untrue ## that's why this is broken out -- feel free to add better brains sub find_word_seq_score(@){ my( @words ) = @_; local $_; my $score = 1; foreach ( @words ){ $score = $score * $DICT{$_} / $TOTAL; } return $score; } sub get_word_stats($){ my ($filename) = @_; open(my $DICT, '<', $filename) or die "unable to open $filename\n"; local $/= undef; local $_; my %dict; my $total = 0; while ( <$DICT> ){ foreach ( split(/\b/, $_) ) { $dict{$_} += 1; $total++; } } close $DICT; return (\%dict, $total); } sub print_results($@){ #( 'word', [qw'test one'], [qw'test two'], ... ) my ($word, @combos) = @_; local $_; my $possible = scalar @combos; print "$word: $possible possibilities\n"; foreach (@combos) { print ' - ', join(' ', @$_), "\n"; } print "\n"; } sub Usage(){ return "$0 /path/to/dictionary /path/to/your_words"; }
这里工作的最佳工具是递归,而不是正则表达式.基本思路是从字符串的开头开始寻找一个单词,然后取出字符串的其余部分并查找另一个单词,依此类推,直到到达字符串的末尾.递归解决方案很自然,因为当字符串的给定余数不能被分成一组单词时,需要进行回溯.下面的解决方案使用字典来确定什么是单词并在找到它们时打印出解决方案(一些字符串可以分解成多个可能的单词集,例如wickedweather可以被解析为"邪恶的我们在她身上").如果您只想要一组单词,则需要确定选择最佳组的规则,
#!/usr/bin/perl use strict; my $WORD_FILE = '/usr/share/dict/words'; #Change as needed my %words; # Hash of words in dictionary # Open dictionary, load words into hash open(WORDS, $WORD_FILE) or die "Failed to open dictionary: $!\n"; while () { chomp; $words{lc($_)} = 1; } close(WORDS); # Read one line at a time from stdin, break into words while (<>) { chomp; my @words; find_words(lc($_)); } sub find_words { # Print every way $string can be parsed into whole words my $string = shift; my @words = @_; my $length = length $string; foreach my $i ( 1 .. $length ) { my $word = substr $string, 0, $i; my $remainder = substr $string, $i, $length - $i; # Some dictionaries contain each letter as a word next if ($i == 1 && ($word ne "a" && $word ne "i")); if (defined($words{$word})) { push @words, $word; if ($remainder eq "") { print join(' ', @words), "\n"; return; } else { find_words($remainder, @words); } pop @words; } } return; }