根据标题,我试图找到一种方法来以编程方式确定几个字符串之间最长的相似性部分.
例:
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
理想情况下,我会回来file:///home/gms8994/Music/
,因为这是所有3个字符串中最常见的部分.
具体来说,我正在寻找一个Perl解决方案,但任何语言(甚至伪语言)的解决方案都足够了.
评论:是的,仅在开头; 但是有可能在列表中有一些其他条目,这个问题将被忽略.
编辑:对不起,我很抱歉.可惜我监督使用my
变量内部countit(x, q{})
是一个很大的错误.此字符串在Benchmark模块中进行评估,@ str在那里为空.这个解决方案没有我提出的那么快.见下面的更正.我很抱歉.
Perl可以很快:
use strict; use warnings; package LCP; sub LCP { return '' unless @_; return $_[0] if @_ == 1; my $i = 0; my $first = shift; my $min_length = length($first); foreach (@_) { $min_length = length($_) if length($_) < $min_length; } INDEX: foreach my $ch ( split //, $first ) { last INDEX unless $i < $min_length; foreach my $string (@_) { last INDEX if substr($string, $i, 1) ne $ch; } } continue { $i++ } return substr $first, 0, $i; } # Roy's implementation sub LCP2 { return '' unless @_; my $prefix = shift; for (@_) { chop $prefix while (! /^\Q$prefix\E/); } return $prefix; } 1;
测试套件:
#!/usr/bin/env perl use strict; use warnings; Test::LCP->runtests; package Test::LCP; use base 'Test::Class'; use Test::More; use Benchmark qw(:all :hireswallclock); sub test_use : Test(startup => 1) { use_ok('LCP'); } sub test_lcp : Test(6) { is( LCP::LCP(), '', 'Without parameters' ); is( LCP::LCP('abc'), 'abc', 'One parameter' ); is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' ); is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ), 'abcd', 'Some common prefix' ); my @str = map { chomp; $_ } ; is( LCP::LCP(@str), 'file:///home/gms8994/Music/', 'Test data prefix' ); is( LCP::LCP2(@str), 'file:///home/gms8994/Music/', 'Test data prefix by LCP2' ); my $t = countit( 1, sub{LCP::LCP(@str)} ); diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}"); $t = countit( 1, sub{LCP::LCP2(@str)} ); diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}"); } __DATA__ file:///home/gms8994/Music/t.A.T.u./ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/
测试套件结果:
1..7 ok 1 - use LCP; ok 2 - Without parameters ok 3 - One parameter ok 4 - None of common prefix ok 5 - Some common prefix ok 6 - Test data prefix ok 7 - Test data prefix by LCP2 # LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 20766.06/s (n=22635) # LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 16746.73/s (n=17919)
这意味着在您的测试用例中使用纯Perl解决方案的substr
速度比Roy的解决方案快约20%,并且一个前缀发现大约需要50us.除非您的数据或性能预期更高,否则没有必要使用XS.
Brett Daniel已经就" 最长公共子串问题 " 的维基百科条目给出的参考是非常好的一般参考(使用伪代码),如上所述.但是,算法可以是指数的.看起来你可能真的想要一个最长公共前缀的算法,这是一个更简单的算法.
这是我用于最长公共前缀(和原始URL的引用)的那个:
use strict; use warnings; sub longest_common_prefix { # longest_common_prefix( $|@ ): returns $ # URLref: http://linux.seindal.dk/2005/09/09/longest-common-prefix-in-perl # find longest common prefix of scalar list my $prefix = shift; for (@_) { chop $prefix while (! /^\Q$prefix\E/); } return $prefix; } my @str = map {chomp; $_} ; print longest_common_prefix(@ARGV), "\n"; __DATA__ file:///home/gms8994/Music/t.A.T.u./ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/
如果您真的想要LCSS实现,请参阅PerlMonks.org上的这些讨论(最长公共子串和最长公共子序列).Tree :: Suffix可能是最好的通用解决方案,据我所知,它可以实现最佳算法.不幸的是,最近的版本被破 但是,Limbic~Region在这篇文章中 PerlMonks引用的讨论中确实存在一个工作子例程(在这里与您的数据一起复制).
#URLref: http://www.perlmonks.org/?node_id=549876 #by Limbic~Region use Algorithm::Loops 'NestedLoops'; use List::Util 'reduce'; use strict; use warnings; sub LCS{ my @str = @_; my @pos; for my $i (0 .. $#str) { my $line = $str[$i]; for (0 .. length($line) - 1) { my $char= substr($line, $_, 1); push @{$pos[$i]{$char}}, $_; } } my $sh_str = reduce {length($a) < length($b) ? $a : $b} @str; my %map; CHAR: for my $char (split //, $sh_str) { my @loop; for (0 .. $#pos) { next CHAR if ! $pos[$_]{$char}; push @loop, $pos[$_]{$char}; } my $next = NestedLoops([@loop]); while (my @char_map = $next->()) { my $key = join '-', @char_map; $map{$key} = $char; } } my @pile; for my $seq (keys %map) { push @pile, $map{$seq}; for (1 .. 2) { my $dir = $_ % 2 ? 1 : -1; my @offset = split /-/, $seq; $_ += $dir for @offset; my $next = join '-', @offset; while (exists $map{$next}) { $pile[-1] = $dir > 0 ? $pile[-1] . $map{$next} : $map{$next} . $pile[-1]; $_ += $dir for @offset; $next = join '-', @offset; } } } return reduce {length($a) > length($b) ? $a : $b} @pile; } my @str = map {chomp; $_} ; print LCS(@str), "\n"; __DATA__ file:///home/gms8994/Music/t.A.T.u./ file:///home/gms8994/Music/nina%20sky/ file:///home/gms8994/Music/A%20Perfect%20Circle/