检查一个字符串是否包含另一个字符串的最快方法是什么?

时间:2015-10-03 14:59:43

标签: java arrays string

我得到字符串a和b并检查b是否包含a的确切字符。例如:“ABBA”和“BAAAAA”返回false,“ABBA”和“ABABAB”返回true。我使用每个String值创建了一个into和array,并检查b是否包含该值,如果这样做则删除该值,它不会找到它两次。

然而,这个方法太慢了,显然在一些大字符串的12秒内。我一直在尝试,但我没有找到更快的解决方案。如果可以,请帮帮我!

public static boolean inneholdt(String a, String b)
{
    int k = 0;
    String[] Inn = a.split("(?!^)");

    for (int i = 0; i < Inn.length; i++)
    {
        if(b.contains(Inn[i]))
        {
            b = b.replaceFirst(Inn[i], "");

            k++;
        }
    }

    if(k >= Inn.length)
    {
        return true;
    } else return false;
}

2 个答案:

答案 0 :(得分:2)

如果我理解了这个问题,有两种主要方法可以做到:

  • 对两个字符串的char数组进行排序,并检查最短数组是最长数组的前缀

  • 填充Map<Character, Integer>,其中计算最短字符串中每个字符的出现次数。然后,遍历最长的字符串并减少每个遇到的字符的计数。如果一个计数器达到0,则将其从地图中删除。如果地图变空,则返回true。如果在消耗了最长字符串的所有字符后,地图没有变空,则返回false

第一个将花费更多时间用于长字符串并且更可能使用大量内存,因为第二个解决方案“压缩”冗余,而排序的数组可能包含长序列的相同字符。但是,如果你很容易写,读和理解第一个,所以如果你不需要疯狂的表现,那就没关系。

我将向您展示第二个解决方案的代码:

// in Java 8. For older versions, it is also easy but more verbose
public static boolean inneholdt(String a, String b) {
    if (b.length() > a.length()) return false;

    Map<Character, Integer> countChars = new HashMap<>();
    for (char ch : b.toCharArray()) countChars.put(ch, countChars.getOrDefault(ch, 0) + 1);
    for (char ch : a.toCharArray()) {
        Integer count = countChars.get(ch);
        if (count != null) {
            if (count == 1) countChars.remove(ch);
            else            countChars.put(ch, count - 1);
        }
        if (countChars.isEmpty()) return true;
    }
    return false;
}

请注意,此解决方案已经过优化,因此执行时间取决于最佳情况下的最短字符串。如果b中包含a,我们很可能不会遍历a的所有字符。如果ba短得多,则此解决方案非常好,因为如果在这种情况下很可能b中包含a

在回答Max的基准测试时,我尝试自己比较性能。这是我发现的:

My version :
Mean time of 3 ms with lenA = 50000, lenB = 50
Mean time of 1 ms with lenA = 50000, lenB = 500
Mean time of 1 ms with lenA = 50000, lenB = 5000
Mean time of 1 ms with lenA = 50000, lenB = 50000
Mean time of 10 ms with lenA = 5000000, lenB = 5000
Mean time of 18 ms with lenA = 5000000, lenB = 50000
Mean time of 93 ms with lenA = 5000000, lenB = 500000
Mean time of 519 ms with lenA = 5000000, lenB = 5000000
Mean time of 75 ms with lenA = 50000000, lenB = 50000
Mean time of 149 ms with lenA = 50000000, lenB = 500000
Mean time of 674 ms with lenA = 50000000, lenB = 5000000
Mean time of 9490 ms with lenA = 50000000, lenB = 50000000

Max's parallel solution :
Mean time of 89 ms with lenA = 50000, lenB = 50
Mean time of 22 ms with lenA = 50000, lenB = 500
Mean time of 23 ms with lenA = 50000, lenB = 5000
Mean time of 36 ms with lenA = 50000, lenB = 50000
Mean time of 2962 ms with lenA = 5000000, lenB = 5000
Mean time of 2021 ms with lenA = 5000000, lenB = 50000
Mean time of 2200 ms with lenA = 5000000, lenB = 500000
Mean time of 3988 ms with lenA = 5000000, lenB = 5000000
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.concurrent.ForkJoinTask.recordExceptionalCompletion(Unknown Source)
    at java.util.concurrent.CountedCompleter.internalPropagateException(Unknown Source)
    at java.util.concurrent.ForkJoinTask.setExceptionalCompletion(Unknown Source)
    at java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
    at java.util.concurrent.ForkJoinPool$WorkQueue.pollAndExecCC(Unknown Source)
    at java.util.concurrent.ForkJoinPool.externalHelpComplete(Unknown Source)
    at java.util.concurrent.ForkJoinTask.externalAwaitDone(Unknown Source)
    at java.util.concurrent.ForkJoinTask.doInvoke(Unknown Source)
    at java.util.concurrent.ForkJoinTask.invoke(Unknown Source)
    at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(Unknown Source)
    at java.util.stream.AbstractPipeline.evaluate(Unknown Source)
    at java.util.stream.ReferencePipeline.collect(Unknown Source)
    at Stack.inneholdt(Stack.java:34)
    at Stack.test(Stack.java:61)
    at Stack.main(Stack.java:12)

这表明Collectors.groupingBy的实现非常耗费内存。 Max的解决方案原则上并不坏,即使它做的工作多于可能的,这是一个高级解决方案,因此它无法控制某些实现细节,例如记录的分组方式。查看Java标准库的代码,它看起来像执行排序所以它需要一些内存,特别是因为几个线程同时排序。我使用默认设置-Xmx2g运行此操作。我使用-Xmx4g重新开始。

My version :
Mean time of 3 ms with lenA = 50000, lenB = 50
Mean time of 1 ms with lenA = 50000, lenB = 500
Mean time of 2 ms with lenA = 50000, lenB = 5000
Mean time of 5 ms with lenA = 50000, lenB = 50000
Mean time of 7 ms with lenA = 5000000, lenB = 5000
Mean time of 17 ms with lenA = 5000000, lenB = 50000
Mean time of 93 ms with lenA = 5000000, lenB = 500000
Mean time of 642 ms with lenA = 5000000, lenB = 5000000
Mean time of 64 ms with lenA = 50000000, lenB = 50000
Mean time of 161 ms with lenA = 50000000, lenB = 500000
Mean time of 836 ms with lenA = 50000000, lenB = 5000000
Mean time of 11962 ms with lenA = 50000000, lenB = 50000000

Max's parallel solution :
Mean time of 45 ms with lenA = 50000, lenB = 50
Mean time of 18 ms with lenA = 50000, lenB = 500
Mean time of 19 ms with lenA = 50000, lenB = 5000
Mean time of 35 ms with lenA = 50000, lenB = 50000
Mean time of 1691 ms with lenA = 5000000, lenB = 5000
Mean time of 1162 ms with lenA = 5000000, lenB = 50000
Mean time of 1817 ms with lenA = 5000000, lenB = 500000
Mean time of 1671 ms with lenA = 5000000, lenB = 5000000
Mean time of 12052 ms with lenA = 50000000, lenB = 50000
Mean time of 10034 ms with lenA = 50000000, lenB = 500000
Mean time of 9467 ms with lenA = 50000000, lenB = 5000000
Mean time of 18122 ms with lenA = 50000000, lenB = 50000000

这次它运行良好,但仍然很慢。请注意,测试每个版本的测试用例不同,这是一个相当糟糕的基准测试,但我认为它足以显示Collectors.groupingBy非常耗费内存并且不尽快返回是一个很大的缺点。

代码可用here

答案 1 :(得分:2)

Java 8 + lambda表达式。

public static boolean inneholdt(String a, String b) {
    // Here we are counting occurrences of characters in the string
    Map<Integer, Long> aCounted = a.chars().parallel().boxed().collect(Collectors.groupingBy(o -> o, Collectors.counting()));
    Map<Integer, Long> bCounted = b.chars().parallel().boxed().collect(Collectors.groupingBy(o -> o, Collectors.counting()));

    // Now we're checking if the second string contains all the characters from the first
    return bCounted.keySet().parallelStream().allMatch(
            x -> bCounted.getOrDefault(x, 0l) >= aCounted.getOrDefault(x, 0l)
    );
}

它是如何运作的?

  1. 计算两个字符串中每个字符的出现次数。
  2. 检查第二个字符串是否至少与每个字符的出现次数相同。
  3. 此解决方案利用并行执行并与任何文字一起使用,因为它计算代码点。但是,对于他所描述的案例,@ Dici的解决方案会快得多。

    此外,我对并行流如何影响性能感兴趣,因此有一些数字。 N代表字符串的长度。

    Alphanumeric N=50000:    26ms   vs 10ms   vs 16ms
    Alphanumeric N=5000000:  425ms  vs 460ms  vs 162ms
    Alphanumeric N=50000000: 3812ms vs 3297ms vs 1933ms
    

    第一个数字用于@Dici的解决方案,第二个用于我的普通流解决方案,第三个用于此答案的版本。