我仍在使用这个庞大的URL列表,我收到的所有帮助都很棒.
目前我的列表看起来像这样(但是17000个URL):
http://www.domain.com/page?CONTENT_ITEM_ID=1
http://www.domain.com/page?CONTENT_ITEM_ID=3
http://www.domain.com/page?CONTENT_ITEM_ID=2
http:// www .domain.com/page?CONTENT_ITEM_ID = 1
http://www.domain.com/page?CONTENT_ITEM_ID=2
http://www.domain.com/page?CONTENT_ITEM_ID=3
http://www.domain.com/页面?CONTENT_ITEM_ID = 3
我可以过滤掉重复项没有问题,有几种方法,awk等.我真正想做的是取出重复的URL,但同时计算列表中存在的URL数和打印次数带管道分隔符的URL旁边的计数.处理完列表后,它应如下所示:
网址| 计算
http://www.domain.com/page?CONTENT_ITEM_ID=1 | 2
http://www.domain.com/page?CONTENT_ITEM_ID=2 | 2
http://www.domain.com/page?CONTENT_ITEM_ID=3 | 3
实现这一目标的最快方法是什么方法?
干杯
这可能与您无需编写代码即可获得的速度一样快.
$ cat foo.txt http://www.domain.com/page?CONTENT_ITEM_ID=1 http://www.domain.com/page?CONTENT_ITEM_ID=3 http://www.domain.com/page?CONTENT_ITEM_ID=2 http://www.domain.com/page?CONTENT_ITEM_ID=1 http://www.domain.com/page?CONTENT_ITEM_ID=2 http://www.domain.com/page?CONTENT_ITEM_ID=3 http://www.domain.com/page?CONTENT_ITEM_ID=3 $ sort foo.txt | uniq -c 2 http://www.domain.com/page?CONTENT_ITEM_ID=1 2 http://www.domain.com/page?CONTENT_ITEM_ID=2 3 http://www.domain.com/page?CONTENT_ITEM_ID=3
做了一些测试,并没有特别快,虽然对于17k它只需要1秒钟(在装载的P4 2.8Ghz机器上)
$ wc -l foo.txt 174955 foo.txt vinko@mithril:~/i3media/2008/product/Pending$ time sort foo.txt | uniq -c 54482 http://www.domain.com/page?CONTENT_ITEM_ID=1 48212 http://www.domain.com/page?CONTENT_ITEM_ID=2 72261 http://www.domain.com/page?CONTENT_ITEM_ID=3 real 0m23.534s user 0m16.817s sys 0m0.084s $ wc -l foo.txt 14955 foo.txt $ time sort foo.txt | uniq -c 4233 http://www.domain.com/page?CONTENT_ITEM_ID=1 4290 http://www.domain.com/page?CONTENT_ITEM_ID=2 6432 http://www.domain.com/page?CONTENT_ITEM_ID=3 real 0m1.349s user 0m1.216s sys 0m0.012s
虽然O()像往常一样赢得比赛.测试了S.Lott的解决方案和
$ cat pythoncount.py from collections import defaultdict myFile = open( "foo.txt", "ru" ) fq= defaultdict( int ) for n in myFile: fq[n] += 1 for n in fq.items(): print "%s|%s" % (n[0].strip(),n[1]) $ wc -l foo.txt 14955 foo.txt $ time python pythoncount.py http://www.domain.com/page?CONTENT_ITEM_ID=2|4290 http://www.domain.com/page?CONTENT_ITEM_ID=1|4233 http://www.domain.com/page?CONTENT_ITEM_ID=3|6432 real 0m0.072s user 0m0.028s sys 0m0.012s $ wc -l foo.txt 1778955 foo.txt $ time python pythoncount.py http://www.domain.com/page?CONTENT_ITEM_ID=2|504762 http://www.domain.com/page?CONTENT_ITEM_ID=1|517557 http://www.domain.com/page?CONTENT_ITEM_ID=3|756636 real 0m2.718s user 0m2.440s sys 0m0.072s