Question

我遇到与字符串比较相关的性能问题（在Java中）。

我正在开发一个需要对一个巨大的列表进行排序的项目（Eclipse中的TableViewer）。无论如何，我已经将瓶颈指向了compareTo（）的调用，以便比较字符串。

有没有办法优化字符串比较的性能？我搜索并用谷歌搜索无济于事......

由于该项目严格限于Win32环境，我当时认为也许可以利用它......

任何建议都将不胜感激。

编辑：我忘了提到我需要数字比较和字符串的字面比较。

EDIT2：目标主要是加速用户界面，因为每次单击表头以执行排序时等待几秒是不可接受的。我正在考虑以某种方式缓存值来加速比较。由于字符串非常静态，我认为这是可能的。

EDIT3：我知道很多人都被try（） - catch（）所困扰。实际上这不是一个问题，因为即使我删除该代码并只执行catch-block（单个compareTo（）），它仍然以与原始代码几乎相同的速度执行。但是，如果我也注释掉compareTo（）;只留下比较功能的开销（获得标签等），它快速闪电。所以我仍然需要一种比较字符串的更好方法。无论是通过缓存还是通过做其他魔术。

不幸的是，不可能改变排序算法 - 但我怀疑它是那么慢，因为它成功地快速排序纯整数。

澄清：

比较函数是作为TableViewer框架的一部分实现的，用于执行排序操作，这意味着我没有实现特定的排序算法，而是由SWT / JFace实现。我只是实现了比较功能。

更有趣的是，用于排序双精度的代码比字符串比较更快。使用数字而不是实际的文字字符串对列进行排序更快....这使我得出结论，在compareTo（）方法中发生了一些可疑的事情......

这是该功能的核心：

// e1Label and e2Label is Strings to be compared
//

// Be smart about the comparison and use non-lexical comparison if
// possible (i.e. if both strings are actually numbers...)
//
// Warning: This is only "semi-smart" as the sorting might get "a bit"
// messed up if some of the values in a column can be parsed as
// doubles while others can not...
//
try {
    // Try using numeric (double) comparison of label values
    //
    double e1_double = Double.parseDouble(e1Label);
    double e2_double = Double.parseDouble(e2Label);
    rc = Double.compare(e1_double, e2_double);
} catch (NumberFormatException e) {
    // Use lexical comparison if double comparison is not possible
    //
    rc = e1Label.compareToIgnoreCase(e2Label);
}

Answer 1

如果您了解自己的String内容，则可以预先计算并存储其他信息，以加快比较速度。例如，假设您的String仅包含大写字母A-Z。您可以根据前3个字母说明String的排名; e.g。

AAA：= 1
AAB：= 2
...
ABA：= 27

然后，您可以通过首先比较每个compareTo的排名（基于快速整数的比较）来加速String，然后在排名相等的情况下仅执行完整的String比较。

Answer 2

即使瓶颈似乎是compareTo（）函数，它也可能在分析器中脱颖而出，因为它是循环中调用最多的函数。

了解您的排序例程的确切功能可能也是有益的。您可能更好地更改排序算法，因为那里可以获得更快的速度。

Answer 3

几乎可以肯定，这种例外正在减慢比较速度。抛出和捕获异常是一项昂贵的操作，并且每个非数字单元格值都会出现异常。

首先考虑使用正则表达式检查值是否为数字，如果不是，则不要尝试解析它。

private static final Pattern numberPattern = Pattern.compile("[-+0-9.e]+");

// ...

// e1Label and e2Label is Strings to be compared
//

// Be smart about the comparison and use non-lexical comparison if
// possible (i.e. if both strings are actually numbers...)
//
// Warning: This is only "semi-smart" as the sorting might get "a bit"
// messed up if some of the values in a column can be parsed as
// doubles while others can not...
//
if (numberPattern.matches(e1Label) && numberPattern.matches(e2Label)) {
    try {
        // Try using numeric (double) comparison of label values
        //
        double e1_double = Double.parseDouble(e1Label);
        double e2_double = Double.parseDouble(e2Label);
        rc = Double.compare(e1_double, e2_double);
    } catch (NumberFormatException e) {
        // Use lexical comparison if double comparison is not possible
        //
        rc = e1Label.compareToIgnoreCase(e2Label);
    }
} else {
    rc = e1Label.compareToIgnoreCase(e2Label);
}

Answer 4

不要将值存储为String对象。创建自己的包装器，只为每个String调用一次Double.parseDouble。缓存响应（值或异常）。它也可能缓存一个不区分大小写的字符串版本。

Answer 5

我真的怀疑你能够加速String.compareTo（）这么多。解决方案可能在于更少地使用compareTo（）。但是如果不了解更多关于算法的信息，就无法告诉你如何做到这一点。

Answer 6

即使你可以从compareTo（）中获得更多的性能，我认为主要的问题是列表的大小。即使假设今天您可以将排序延迟减少到可接受的范围（1秒？），如果明年应用程序需要显示两倍大的列表呢？排序算法是O（n log n），因此将列表大小加倍会使排序显着更慢。

要获得强大的解决方案，请查看虚拟表（使用SWT.VIRTUAL属性）。然后，您可以实现不需要预先进行完整排序的基础数据提供程序。具体如何实现它将取决于您的数据来自何处。如果它来自数据库，您可以考虑在所有可排序字段上放置索引。如果没有办法做到这一点，你可以使用其他策略，例如，如果你有一些快速方法将表分成块（例如以“A”开头的行，以“B”开头的行等），那么你可以通过仅提取第一个块中的行，对它们进行排序并显示它们来开始，因为用户始终从表的顶部开始。后续块的排序可以在后台线程中继续。

Answer 7

如果您需要“文字”和“数字”比较，那么这些字符串包含哪些数据？他们总是代表数字吗？

如果它们只包含数字，那么将它们存储为数字可能要快得多（除了做更干净的事情之外）。

如果您需要“文字”比较（我将其解释为在“20”之前排序为“100”），那么您可以轻松地在int或long s上实现该比较，并使用一些数学可能仍然比String比较快得多。

Answer 8

由于renier和Guillaume已经说过String.compareTo（）不应该受到指责。它应该比数字比较慢，但不是那么重要。

即使您的列表长达百万项，也不应超过一秒钟。

如果是一个选项，我会在后台进行搜索，即为字符串附加某种索引。

您应该真正分析最常发生的操作类型，单次插入，质量未排序插入，批量部分排序插入，排序，删除等。

根据最常见的操作，您可以选择更合适的数据结构。

Answer 9

为什么不在开头对列表进行一次排序，并使用插入排序更新？然后，当您想要将顺序从升序更改为降序时，信息已经存在。如果您想按其他列排序，那么只需保留列表，以便切换回此列？或者这在SWT中是不可行的？（自从我使用它已经有一段时间了）

Answer 10

在我看来，你需要做的就是尽可能多地调用String.compareTo（）。基本上有两种方法可以做到这一点。

1）实施某种形式的bucket sort，以避免执行所有这些比较。

根据要排序的字符串数量（数千？百万？），在空间和垃圾收集方面，使用完整的存储桶排序可能需要太多开销。

为了避免你可以执行常数轮的排序，所以字符串被分类到包含所有字符串的列表中，例如前10个字母匹配。然后，您可以使用内置排序对每个存储桶进行排序。

2）对每个字符串进行散列并对散列进行排序（确保处理冲突）。然后你可以重新排序字符串。这可能是最简单的解决方案。

使用这些解决方案中的任何一个都可以让您在不到一秒的时间内对数百万个字符串进行排序。

Answer 11

根据您最近的澄清，这是第二个答案：创建一个类：Item，可用于表示数字或字母数字值，并可以确定是否是 upfront 。这样，您可以避免在compareTo方法调用期间解析值并处理任何异常的开销。

public class Item implements Comparable<Item> {
    private final String s;
    private final double d;
    private final boolean numeric;

    public Item(String s) {
        double tmpD;
        boolean tmpNumeric;

        try {
            // Do the work of parsing / catching exceptions *upfront*.
            tmpD = Double.parseDouble(s);
            tmpNumeric = true;
        } catch(NumberFormatException ex) {
            // Parse failed so must be a String.
            tmpD = 0.0;
            tmpNumeric = false;
        }

        this.s = s;
        this.d = tmpD;
        this.numeric = tmpNumeric;
    }

    public String asString() {
        return s;
    }

    public double asDouble() {
        if (!numeric) {
            throw new IllegalStateException("Not a numeric value: " + s);
        }

        return d;
    }

    public boolean isNumeric() {
        return numeric;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Item)) return false;

        Item item = (Item) o;

        return Double.compare(item.d, d) == 0 && s.equals(item.s);
    }

    @Override
    public int hashCode() {
        int result;
        long temp;
        result = s.hashCode();
        temp = d != +0.0d ? Double.doubleToLongBits(d) : 0L;
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        return result;
    }

    public int compareTo(Item item) {
        int ret;

        if (numeric && item.isNumeric()) {
            // Both items are numeric so do fast comparison.
            double diff = d - item.asDouble();
            if (diff > 0.0) {
                ret = 1;
            } else if (diff < 0.0) {
                ret = -1;
            } else {
                ret = 0;
            }
        } else {
            ret = s.compareTo(item.asString());
        }

        return ret;
    }
}

Answer 12

为什么不试试特里？

http://algs4.cs.princeton.edu/52trie/

http://en.wikipedia.org/wiki/Radix_tree

正如罗伯特·塞奇威克（Robert Sedgewick）所写：“命题H.在大小为R的字母表中由N个随机密钥构建的特里结构中检索到的搜索未命中节点的平均数为~logR N”。 [塞奇威克，罗伯特;韦恩，凯文（2011-02-21）。算法（第4版）（Kindle Locations 12674-12676）。培生教育（美国）。 Kindle版。]

比较字符串的最快方法（文字和数字）

12 个答案: