Question

我创建了一种方法来查找字符串中最常见的字符：

public static char getMax(String s) {

char maxappearchar = ' ';
int counter = 0;
int[] charcnt = new int[Character.MAX_VALUE + 1];


for (int i = 0 ; i < s.length() ; i++)
{
    char ch = s.charAt(i);
    // increment this character's cnt and compare it to our max.
    charcnt[ch]++ ;
    if (charcnt[ch] >= counter)
    {
        counter = charcnt[ch];
        maxappearchar = ch;
    } 
}
System.out.println("the max char is   " +maxappearchar + "  and displayed  " +counter+ "  times");
return maxappearchar;
}

我在问它不同的解决方案：

解决方案1 - 最快的代码（是我的附加代码？）
解决方案2 - 在内存方面最有效，减少数组和变量的使用

我使用HashMap创建了我的方法 - 是否更适合解决方案2？如果是这样的话？什么是利弊？

附加的代码是否适用于o技术（o ^，o logn ...）？如果是这样的话？

Answer 1

这是一种使用大量空间的快速算法。

它不包括完整的Unicode，还有需要两个字符的代码点（Unicode字符，整数）。

仍然可以进行小优化：

使用byte[]和short[]创建额外版本，具体取决于s.length()。

将length()保留在变量中

for (int i = 0, n = s.length(); i < n; i++)

是的HashMap可能是最明智的＆＃34;溶液

现在使用java 8，您可能会转向并行：使用多核。不值得努力。

int mostFrequentCodePoint = s.codePoints()
    ...

对于自然语言的频率分析，将字符串的长度限制在1000左右就足够了。

Answer 2

执行此操作的最快方法是计算每个字符的出现次数，然后取计数数组中的最大值。如果你的字符串很长，那么在循环字符串中的字符时，你不会跟踪当前的最大值，从而获得不错的加速。

有关如何计算频率的许多其他想法，请参阅How to count frequency of characters in a string?。

如果你的字符串主要是ASCII，那么count循环中的一个分支可以在低128字符值的数组或其余的HashMap之间进行选择，这应该是值得的。如果您的字符串没有非ASCII字符，分支将很好地预测。如果在ascii和非ascii之间有很多交替，那么与使用HashMap处理所有内容相比，分支可能会受到一些伤害。

public static char getMax(String s) {

    char maxappearchar = ' ';
    int counter = 0;
    int[] ascii_count = new int[128];  // fast path for ASCII
    HashMap<Character,Integer> nonascii_count = new HashMap<Character,Integer>();

    for (int i = 0 ; i < s.length() ; i++)
    {
        char ch = s.charAt(i);  // This does appear to be the recommended way to iterate over a String
        // alternatively, iterate over 32bit Unicode codepoints, not UTF-16 chars, if that matters.
        if (ch < 128) {
            ascii_count[ch]++;
        } else {
            // some code to set or increment the nonascii_count[ch];
        }
    }

    // loop over ascii_count and find the highest element
    // loop over the keys in nonascii_count, and see if any of them are even higher.
    return maxappearchar;
}

我没有充实代码，因为我没有做很多Java，所以IDK如果有一个容器而不是插入 - 1 - 或者增量操作比HashMap更有效get和put对。MultiSet<Character> https://stackoverflow.com/a/6712620/224132建议看起来很好的番石榴int。

这可能比你的2 ^ 16 charcnt[Character.MAX_VALUE]数组更好。但是，如果您只触摸此阵列的低128个元素，则可能永远不会触及大部分内存。分配但未触及的内存并没有真正受到伤害，或者耗尽RAM /交换。

但是，在末尾循环遍历所有65536个条目意味着至少读取它，因此操作系统必须将其软页面故障并将其连接起来。它会污染缓存。实际上，更新每个角色的最大值可能是更好的选择。 Microbenchmarks可能会显示迭代字符串，然后循环<a>获胜，但这不会解释缓存/ TLB污染触及那么多非真正需要的内存。

Answer 3

使用上面的解决方案为ASCII返回SimpleEntry<Character,Integer>（完整实现）：

public static Map.Entry getMostCommonChar(String phrase) {
    if (phrase == null || phrase.isEmpty()) {
        throw new IllegalArgumentException("input phrase must have non-empty value.");
    }

    char maxchar = ' ';
    int counter = 0;
    int[] ascii_count = new int[Character.MAX_VALUE];  // fast path for ASCII

    for (int i = 0; i < phrase.length(); i++) {
        char ch = phrase.charAt(i);  // This does appear to be the recommended way to iterate over a String
        if (ascii_count[ch]++ >= counter) {
            counter = ascii_count[ch];
            maxchar = ch;
        }
    }

    Map.Entry<Character,Integer> e = new AbstractMap.SimpleEntry<>(maxchar,counter);

    System.out.println(e.getKey());
    System.out.println(e.getValue());

    return e;
}

Answer 4

public class HelloWorld {

    public static void main(String[] args) {

        String word = "Ferrari";

        String mostUsedChar = "";
        int count = 0;

        String[] array = word.split("");

        for (int i = 0; i < array.length; i++) {
            int tempCount = 0;

            for (int j = 0; j < array.length; j++)
            {
                if (array[i].equals(array[j])) {
                    tempCount++;
                }
                if (tempCount > count) {
                    count = tempCount;
                    mostUsedChar = array[i];
                }
            }
        }
        System.out.println(count + " Most Used Char: " + mostUsedChar);
    }
}

查找字符串中最常见字符的更有效方法

4 个答案: