当前位置:  开发笔记 > 编程语言 > 正文

Java中StringTokenizer类与String.split方法的性能

如何解决《Java中StringTokenizer类与String.split方法的性能》经验,为你挑选了3个好方法。

在我的软件中,我需要将字符串分成单词.我目前拥有超过19,000,000个文档,每个文档超过30个单词.

以下哪两种方法是最好的方法(在性能方面)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

要么

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

Peter Lawrey.. 63

如果您的数据已经在数据库中,您需要解析字符串,我建议重复使用indexOf.它比任何一种解决方案快很多倍.

但是,从数据库获取数据仍然可能要昂贵得多.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List list = new ArrayList();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List list = new ArrayList();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

版画

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

打开文件的成本约为8毫秒.由于文件太小,您的缓存可能会将性能提高2-5倍.即使如此,它将花费大约10个小时打开文件.使用split vs StringTokenizer的成本远低于0.01 ms.解析1900万x 30个单词*每个单词8个字母大约需要10秒钟(每2秒约1 GB)

如果你想提高性能,我建议你有更少的文件.例如,使用数据库.如果您不想使用SQL数据库,我建议使用其中一个http://nosql-database.org/



1> Peter Lawrey..:

如果您的数据已经在数据库中,您需要解析字符串,我建议重复使用indexOf.它比任何一种解决方案快很多倍.

但是,从数据库获取数据仍然可能要昂贵得多.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List list = new ArrayList();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List list = new ArrayList();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

版画

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

打开文件的成本约为8毫秒.由于文件太小,您的缓存可能会将性能提高2-5倍.即使如此,它将花费大约10个小时打开文件.使用split vs StringTokenizer的成本远低于0.01 ms.解析1900万x 30个单词*每个单词8个字母大约需要10秒钟(每2秒约1 GB)

如果你想提高性能,我建议你有更少的文件.例如,使用数据库.如果您不想使用SQL数据库,我建议使用其中一个http://nosql-database.org/


有趣的是,我运行你的代码并且`split`在我的机器上一直占用的时间是`StringTokenizer`的两倍.`indexof`需要一半的时间.
@Peter Lawrey:StringTokenizer不使用正则表达式.

2> nes1983..:

在Java 7中拆分只是为此输入调用indexOf,请参阅源代码.拆分应该非常快,接近indexOf的重复调用.



3> developer..:

Java API规范建议使用split.请参阅文档StringTokenizer.


问题非常清楚,他正在寻找在性能方面做到这一点的最佳方法.API建议拆分,但没有提到(根据我通过谷歌发现的其他一切)Tokenize表现更好.
推荐阅读
可爱的天使keven_464
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有