在Java

时间:2015-07-15 20:47:15

标签: java string integer type-conversion

在java中,将子字符串转换为整数而不使用Integer.parseInt的最快方法是什么?我想知道是否有办法避免使用parseInt,因为它需要我创建一个临时字符串,它是我想要转换的子字符串的副本。

"abcd12345abcd"  <-- just want chars 4..8 converted.

我想避免通过不使用子字符串来创建一个新的临时字符串。

如果我要自己滚动,有没有办法避免我在String.charAt(int)内看到的数组边界检查的开销?

修改

我从大家那里得到了很多好的信息......以及关于预优化的常见警告:)基本答案是没有比String.charAt或char []更好的了。不安全的代码即将推出(可能)。编译器可能可以优化[]上的过度范围检查。

我做了一些基准测试,由于不使用子字符串和滚动特定的parseInt而导致的节省是巨大的。

调用Integer.parseInt(str.substring(4,8))的成本的32%来自子字符串。这不包括后续的垃圾收集成本。

Integer.parseInt旨在处理非常广泛的输入。通过使用charAt滚动我自己的parseInt(特定于我们的数据的样子),我能够比子串方法实现6倍的加速

尝试char []的评论会导致性能提升约7倍。但是,您的数据必须已经在char []中,因为转换为char数组的成本很高。对于解析文本,似乎完全保留在char []中并编写一些函数来比较字符串。

基准测试结果(越小越快):

parseInt(substring)  23731665
parseInt(string)     16859226
Atoi1                 7116633
Atoi2                 4514031
Atoi3 char[]          4135355
Atoi4 char[]          3503638
Atoi5 char[]          5485495
GetNumber1            8666020
GetNumber2            5951939

在基准测试期间,我还尝试使用Inline开启和关闭,并验证编译器正确地内联所有内容。

如果有人关心,这是我的基准代码......

package javaatoi;

import java.lang.management.GarbageCollectorMXBean;
import java.lang.management.ManagementFactory;

public class JavaAtoi {

    static int cPasses = 10;
    static int cTests = 9;
    static int cIter = 0x100000;
    static int cString = 0x100;
    static int fStringMask = cString - 1;

    public static void main(String[] args) throws InterruptedException {

        // setup test data.  Use a large enough set that the compiler 
        // wont unroll the loop.  Use a small enough set that we are 
        // keeping the data in L2.  I don't want to measure memory loads.

        String[] a = new String[cString];
        for (int i = 0 ; i< cString ; i+=4) {
            // leading zeros will occur, so add one number with one.
            a[i+0] = "abcd01234abcd";
            a[i+1] = "abcd1234abcd";
            a[i+2] = "abcd1234abcd";
            a[i+3] = "abcd1234abcd";
        }

        // array of pre-substringed stuff
        String[] a1 = new String[cString];
        for (int i=0 ; i< cString ; ++i)
            a1[i]= a[i].substring(4,8);

        // char array version of the strings
        char[][] b = new char[cString][];
        for (int i =0 ; i<cString ; ++i)
            b[i] = a[i].toCharArray();

        // array to hold times for each test for each pass
        long[][] t = new long[cPasses][cTests];

        // multiple dry runs to let the compiler optimize the functions
        for (int i=0 ; i<50 ; ++i) {
          t[0][0] = TestParseInt1(a)[0];
          t[0][1] = TestParseInt2(a1)[0];
          t[0][2] = TestAtoi1(a)[0];
          t[0][3] = TestAtoi2(a)[0];
          t[0][4] = TestAtoi3(b)[0];
          t[0][5] = TestAtoi4(b)[0];
          t[0][6] = TestAtoi5(b)[0];
          t[0][7] = TestAtoi6(a)[0];
          t[0][8] = TestAtoi7(a)[0];
        }

        // now do a bunch of tests
        for (int i=0 ; i<cPasses ; ++i) {
            t[i][0] = TestParseInt1(a)[0];
            t[i][1] = TestParseInt2(a1)[0];
            t[i][2] = TestAtoi1(a)[0];
            t[i][3] = TestAtoi2(a)[0];
            t[i][4] = TestAtoi3(b)[0];
            t[i][5] = TestAtoi4(b)[0];
            t[i][6] = TestAtoi5(b)[0];
            t[i][7] = TestAtoi6(a)[0];
            t[i][8] = TestAtoi7(a)[0];
        }

        // setup mins - we only care about min time.
        t[cPasses-1] = new long[cTests];
        for (int i=0 ; i<cTests ; ++i)
            t[cPasses-1][i] = 999999999;
        for (int j=0 ; j<cTests ; ++j) {
            for (int i=0 ; i<cPasses-1 ; ++i) {
                long n = t[i][j];
                if (n < t[cPasses-1][j])
                    t[cPasses-1][j] = n;
            }
        }

        // output string
        String s = new String();
        for (int j=0 ; j<cTests ; ++j) {
            for (int i=0 ; i<cPasses ; ++i) {
                long n = t[i][j];
                s += String.format("%9d", n);
            }
            s += "\n";
        }
        System.out.println(s);

        // if you comment out the part of TestParseInt1 you can sorta see the 
        // gc cost.
        System.gc(); // Trying to get an idea of the total substring cost
        Thread.sleep(1000);  // i dunno if this matters.  Seems like the gc takes a little while.  Not real exact...

        long collectionTime = 0;
        for (GarbageCollectorMXBean garbageCollectorMXBean : ManagementFactory.getGarbageCollectorMXBeans()) {
            long n = garbageCollectorMXBean.getCollectionTime();
            if (n > 0) 
                collectionTime += n;
        }

        System.out.println(collectionTime*1000000);
    }

   // you have to put each test function in its own wrapper to 
   // get the compiler to fairly optimize each test.
   // I also made sure I incremented n and used a large # of string
   // to make it harder for the compiler to eliminate the loops.

    static long[] TestParseInt1(String[] a) {
        long n = 0;
        long startTime = System.nanoTime();
        // comment this out to get an idea of gc cost without the substrings
        // then uncomment to get idea of gc cost with substrings
        for (int i=0 ; i<cIter ; ++i) 
            n += Integer.parseInt(a[i&fStringMask].substring(4,8));
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestParseInt2(String[] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Integer.parseInt(a[i&fStringMask]);
        return new long[] { System.nanoTime() - startTime, n };
    }


    static long[] TestAtoi1(String[] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi1(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestAtoi2(String[] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi2(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestAtoi3(char[][] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi3(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestAtoi4(char[][] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi4(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestAtoi5(char[][] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi5(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestAtoi6(String[] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi6(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static long[] TestAtoi7(String[] a) {
        long n = 0;
        long startTime = System.nanoTime();
        for (int i=0 ; i<cIter ; ++i) 
            n += Atoi7(a[i&fStringMask], 4, 4);
        return new long[] { System.nanoTime() - startTime, n };
    }

    static int Atoi1(String s, int i0, int cb) {
        int n = 0;
        boolean fNeg = false;   // for unsigned T, this assignment is removed by the optimizer
        int i = i0;
        int i1 = i + cb;
        int ch;
        // skip leading crap, scan for -
        for ( ; i<i1 && ((ch = s.charAt(i)) > '9' || ch <= '0') ; ++i) {
            if (ch == '-') 
                fNeg = !fNeg;
        }
        // here is the loop to process the valid number chars.
        for ( ; i<i1 ; ++i) 
            n = n*10 + (s.charAt(i) - '0'); 
        return (fNeg) ? -n : n;
    }

    static int Atoi2(String s, int i0, int cb) {
        int n = 0;
        for (int i=i0 ; i<i0+cb ; ++i) {
            char ch = s.charAt(i);
            n = n*10 + ((ch <= '0') ? 0 : ch - '0');
        }
        return n;
    }

    static int Atoi3(char[] s, int i0, int cb) {
        int n = 0, i = i0, i1 = i + cb;
        // skip leading spaces or zeros
        for ( ; i<i1 && s[i] <= '0' ; ++i) { }
        // loop to process the valid number chars.
        for ( ; i<i1 ; ++i) 
            n = n*10 + (s[i] - '0');
        return n;
    }   

    static int Atoi4(char[] s, int i0, int cb) {
        int n = 0;
        // loop to process the valid number chars.
        for (int i=i0 ; i<i0+cb ; ++i) {
            char ch = s[i];
            n = n*10 + ((ch <= '0') ? 0 : ch - '0');
        }
        return n;
    }   

    static int Atoi5(char[] s, int i0, int cb) {
        int ch, n = 0, i = i0, i1 = i + cb;
        // skip leading crap or zeros
        for ( ; i<i1 && ((ch = s[i]) <= '0' || ch > '9') ; ++i) { }
        // loop to process the valid number chars.
        for ( ; i<i1 && (ch = s[i] - '0') >= 0 && ch <= 9 ; ++i) 
            n = n*10 + ch;
        return n;
    }   

    static int Atoi6(String data, int start, int length) {
        int number = 0;
        for (int i = start; i <= start + length; i++) {
            if (Character.isDigit(data.charAt(i))) {
                number = (number * 10) + (data.charAt(i) - 48);
            }
        }       
        return number;
    }

    static int Atoi7(String data, int start, int length) {
        int number = 0;
        for (int i = start; i <= start + length; i++) {
            char ch = data.charAt(i);
            if (ch >= '0' && ch <= '9') {
                number = (number * 10) + (ch - 48);
            }
        }       
        return number;
    }

}

4 个答案:

答案 0 :(得分:2)

对不起......如果没有以下任何一种方法,你真的无法完成你想做的事情:

  • 创建中级String
  • 创建一些其他中间对象代替String,然后将其解析为int

Java不像C ++; a String isn't the same as a char[]

正如我之前提到的,对String返回String的所有操作都会生成一个新的 String实例,所以你不可避免地要处理String处于中间状态。

这里的主要问题是,如果你实际上知道子串边界,那么使用它们来完成你需要的东西。

Do not worry about optimization,直到您可以推断出这部分代码是最大的瓶颈。即便如此,坚持有意义的优化;您可以将整个String转换为IntStream,并仅解析Java 8中实际数字的元素。

有可能这段代码不会成为主要的性能损失,过早地优化它会导致你走上非常非常痛苦的道路。

实际上,您可以获得的最接近的(使用Java 8&Stream API)是在CharacterString之间进行一些转换,但这仍然会产生中间转换String S:

System.out.println(Integer.parseInt("abcd12345abcd".chars()
                                                   .filter(Character::isDigit)
                                                   .mapToObj(c -> (char) c)
                                                   .map(Object::toString)
                                                   .reduce("", String::concat)));

... 更难以阅读和理解:

System.out.println(Integer.parseInt("abcd12345abcd".substring(4, 9)));

答案 1 :(得分:1)

更新

看到你想要模仿Java中的C / C ++行为,在做了一些谷歌搜索之后,我遇到了http://ssw.jku.at/Research/Papers/Wuerthinger07/ 你可能会感兴趣的。

  

阵列边界检查Java HotSpot™客户端编译器的消除   摘要

     

每当访问数组元素时,Java虚拟机都会执行   比较指令以确保索引值在有效范围内   界限。这降低了Java程序的执行速度。排列   边界检查消除识别此类检查的情况   是多余的,可以删除。我们提出了一个数组边界检查   基于静态的Java HotSpot™VM消除算法   在即时编译器中进行分析。

     

该算法适用于静态单一的中间表示   赋值表单并维护索引表达式的条件。它   如果可以证明它们永远不会失败,则完全删除边界检查。   只要有可能,它就会将边界检查移出循环。静电   检查的数量保持不变,但可能会在循环内进行检查   更频繁地执行。如果这样的检查失败,则执行   程序回退到解释模式,避免了问题   异常被抛到错误的地方。

     

评估显示接近理论最大值的加速   科学的SciMark基准套件(平均40%)。算法   还提高了SPECjvm98基准测试套件的执行速度   (平均为2%,最高为12%)。

此处找到完整的研究论文http://www.ssw.uni-linz.ac.at/Research/Papers/Wuerthinger07/Wuerthinger07.pdf

OLD ANSWER 2

由于您知道字符串中数字的开头和长度,因此您仍然可以“滚动自己的”而不进行边界检查。无论哪种方式,你将不得不做一些提取来获得数字。是否提取到临时字符串然后转换它,或者即时转换字符。

public static void main(String[] args) throws Exception {
    String data = "abcd12345abcd";
    System.out.println(getNumber(data, 4, 5));
}

public static int getNumber(String data, int start, int length)
{
    int number = 0;
    for (int i = start; i <= start + length; i++) {
        char c = data.charAt(i);
        if ('0' <= c && c <= '9') {
            number = (number * 10) + (c - 48);
        }
    }
    return number;
}

结果:

12345

OLD ANSWER 1

使用String.replaceAll()删除不需要的内容,然后转换/解析剩下的内容。

public static void main(String[] args) throws Exception {
    String data = "abcd12345abcd";

    int myInt = Integer.valueOf(data.replaceAll("[^0-9]", ""));
    System.out.println(myInt);
}

结果:

12345

答案 2 :(得分:0)

请记住,这不是我通常会如何处理此问题(选择使用正则表达式来过滤掉非数字)。但是,下面的解决方案不会创建单独的字符串(除了字符数组之外)。

public static int getIntegerFromString(String s) {
    int multiplier, result = 0;
    boolean inIntegers = false, beforeInteger = true;
    char[] chars = s.toCharArray();
    char c;

    // Iterate through each character, starting at the end
    for(int i = chars.length - 1; i >= 0; i--) {
        c = chars[i];
        if(Character.isDigit(c)) {

            // The char is a digit, so we either increase the multiplier (if the previous char was also a digit) or prepare our environment
            if(inIntegers) {
                multiplier *= 10;
            }
            else {
                inIntegers = true;
                beforeInteger = false;
                multiplier = 1;
            }

            result += multiplier * Character.getNumericValue(c);
        }
        else if(inIntegers) {
            // We're done with the sequence of integers. Stop the for-loop.
            break;
        }
    }

    return result;
}
[chris@localhost:Projects]$ java Test 3949
3949
[chris@localhost:Projects]$ java Test 3949G
3949
[chris@localhost:Projects]$ java Test E3949G
3949

答案 3 :(得分:-2)

您可以尝试查看sun.misc.Unsafe。我实际上从未使用它,但是如果你想避免边界检查等,那么可以使用这个(未记录的)类来做到这一点。

请参阅https://stackoverflow.com/questions/5574241/how-can-sun-misc-unsafe-be-used-in-the-real-world

编辑: 关于删除Java 9中的Unsafe(作者认为,由于许多库使用它,因此删除它不是一个好主意):http://blog.dripstat.com/removal-of-sun-misc-unsafe-a-disaster-in-the-making/

也可以使用JNI,但我想用普通方法调用它会导致大量开销(如果已将边界检查定义为开销)

请参阅What makes JNI calls slow?

以下链接也可能很有趣,作者还说,经常调用但运行时间较短的方法难以优化: https://thinkingandcomputing.com/2014/03/30/eliminating-jni-overhead/

您可以通过以下方式获取不安全信息:

    int[] x = new int[]{1,2,3,4};
    final int offset = unsafe.arrayBaseOffset(int[].class);
    final int arrayIndexScale = unsafe.arrayIndexScale(int[].class);
    for (int i=0;i<4;i++){
        unsafe.putInt(x, offset+arrayIndexScale*i, 11*(i+1));
    }
    System.out.println(Arrays.toString(x));

有关详细信息,请参阅:http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/

不安全数组的示例:

  Output: [11, 22, 33, 44]
    uint16_t ReverseInt16(uint16_t nonreversed) {
    uint16_t reversed = 0;
    reversed |= (nonreversed & 1 << 15) << 0; //check if bit 15 of nonreversed int is 1, if yes, write 1 to position 0, else write 0 to position 0
    reversed |= (nonreversed & 1 << 14) << 1;
    reversed |= (nonreversed & 1 << 13) << 2;
    reversed |= (nonreversed & 1 << 12) << 3;
    reversed |= (nonreversed & 1 << 11) << 4;
    reversed |= (nonreversed & 1 << 10) << 5;
    reversed |= (nonreversed & 1 << 9) << 6;
    reversed |= (nonreversed & 1 << 8) << 7;
    reversed |= (nonreversed & 1 << 7) << 8;
    reversed |= (nonreversed & 1 << 6) << 9;
    reversed |= (nonreversed & 1 << 5) << 10;
    reversed |= (nonreversed & 1 << 4) << 11;
    reversed |= (nonreversed & 1 << 3) << 12;
    reversed |= (nonreversed & 1 << 2) << 13;
    reversed |= (nonreversed & 1 << 1) << 14;
    reversed |= (nonreversed & 1 << 0) << 15;
    return reversed;
}