根据每个字符的出现次数对String进行有效排序

时间:2016-09-30 03:40:10

标签: java string performance sorting

我试图按照每个字符的出现次数对字符串进行排序,最开始时最频繁,最后最少。排序后,我需要删除所有字符重复。因为示例总是更清晰,所以程序应该执行以下操作:

String str = "aebbaaahhhhhhaabbbccdfffeegh";
String output = sortByCharacterOccurrencesAndTrim(str);

在这种情况下,< sortByCharacterOccurrencesAndTrim'方法应该返回:

String output = "habefcdg"

如果2个字符具有相同的出现次数,则它们在返回的字符串中的顺序并不重要。所以" habefcdg"也可以等于" habfecgd",因为两者都是' f'并且' e'发生3次,并且都是' d'并且' g'发生一次。

"habefcdg" would effectively be the same as "habfecgd"

注意: 我想在这种情况下指出性能很重要,所以我更倾向于采用最有效的方法。我这样说是因为字符串长度的范围从1到最大长度(我认为它与Integer.MAX_VALUE相同,但不确定),所以我想尽量减少任何潜在的瓶颈。

4 个答案:

答案 0 :(得分:5)

“地图和几个while循环”当然是最简单的方法,而且可能会非常快。这个想法是:

for each character
    increment its count in the map
Sort the map in descending order
Output the map keys in that order

但是100,000,000个地图查找可能会非常昂贵。您可以通过创建一个65,536整数计数(如果它是ASCII的128个字符)的数组来加速它。然后:

for each character
    array[(int)ch] += 1

然后,您浏览该数组并创建一个非零计数字符的地图:

for i = 0 to 65535
    if array[i] > 0
        map.add((char)i, array[i])

然后按降序对地图进行排序,并按顺序输出字符。

这可能会表现得相当快,仅仅因为索引到一个阵列100,000,000次可能比进行100,000,000次地图查找要快得多。

答案 1 :(得分:4)

注意:这不是一个答案,只是通过Jim MischelÓscar López显示答案的效果测试代码(并行流以响应comment by OP)。

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.function.Function;
import java.util.stream.Collectors;

public class Test {
    public static void main(String[] args) {
        long start = System.currentTimeMillis();
        String s = buildString();
        System.out.println("buildString: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result1 = testUsingArray(s);
        System.out.println("testUsingArray: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result2 = testUsingMap(s);
        System.out.println("testUsingMap: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result3 = testUsingStream(s);
        System.out.println("testUsingStream: " + (System.currentTimeMillis() - start) + "ms");

        start = System.currentTimeMillis();
        String result4 = testUsingParallelStream(s);
        System.out.println("testUsingParallelStream: " + (System.currentTimeMillis() - start) + "ms");

        System.out.println(result1);
        System.out.println(result2);
        System.out.println(result3);
        System.out.println(result4);
    }
    private static String buildString() {
        Random rnd = new Random();
        char[] buf = new char[100_000_000];
        for (int i = 0; i < buf.length; i++)
            buf[i] = (char)(rnd.nextInt(127 - 33) + 33);
        return new String(buf);
    }
    private static String testUsingArray(String s) {
        int[] count = new int[65536];
        for (int i = 0; i < s.length(); i++)
            count[s.charAt(i)]++;
        List<CharCount> list = new ArrayList<>();
        for (int i = 0; i < 65536; i++)
            if (count[i] != 0)
                list.add(new CharCount((char)i, count[i]));
        Collections.sort(list);
        char[] buf = new char[list.size()];
        for (int i = 0; i < buf.length; i++)
            buf[i] = list.get(i).ch;
        return new String(buf);
    }
    private static String testUsingMap(String s) {
        Map<Character, CharCount> map = new HashMap<>();
        for (int i = 0; i < s.length(); i++)
            map.computeIfAbsent(s.charAt(i), CharCount::new).count++;
        List<CharCount> list = new ArrayList<>(map.values());
        Collections.sort(list);
        char[] buf = new char[list.size()];
        for (int i = 0; i < buf.length; i++)
            buf[i] = list.get(i).ch;
        return new String(buf);
    }
    private static String testUsingStream(String s) {
        int[] output = s.codePoints()
                        .boxed()
                        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                        .entrySet()
                        .stream()
                        .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed())
                        .mapToInt(Map.Entry::getKey)
                        .toArray();
        return new String(output, 0, output.length);
    }
    private static String testUsingParallelStream(String s) {
        int[] output = s.codePoints()
                        .parallel()
                        .boxed()
                        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                        .entrySet()
                        .parallelStream()
                        .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed())
                        .mapToInt(Map.Entry::getKey)
                        .toArray();
        return new String(output, 0, output.length);
    }
}
class CharCount implements Comparable<CharCount> {
    final char ch;
    int count;
    CharCount(char ch) {
        this.ch = ch;
    }
    CharCount(char ch, int count) {
        this.ch = ch;
        this.count = count;
    }
    @Override
    public int compareTo(CharCount that) {
        return Integer.compare(that.count, this.count); // descending
    }
}

示例输出

buildString: 974ms
testUsingArray: 48ms
testUsingMap: 216ms
testUsingStream: 1279ms
testUsingParallelStream: 442ms
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h
UOMP<FV{KHt`(-q6;Gl'R9nxy+.Y[=2a7^45v?E@e,>|AD_\ILpJ}8sow"Z&bCmNW1$!Sd0c]~g3BjX#fz:Q*Tkui%/r)h

答案 2 :(得分:3)

只是为了好玩(而且我并没有声称这是最有效的解决方案):一些Java 8 lambdas +并行流怎么样?

public String sortByCharacterOccurrencesAndTrim(String str) {

    // build a frequency map, for each code point store its count    
    Map<Integer, Long> frequencies =
        str.codePoints()
           .parallel()
           .boxed()
           .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

    // sort by descending frequency and collect code points into array
    int[] output =
        frequencies.entrySet()
                   .parallelStream()
                   .sorted(Map.Entry.<Integer, Long>comparingByValue().reversed())
                   .mapToInt(Map.Entry::getKey)
                   .toArray();

    // create output string from code point array
    return new String(output, 0, output.length);

}

如果你想要一个超级高效的解决方案,你可以使用显式循环重写上述算法,但这是很多代码而且对我来说已经很晚了:)。然而,这个想法将是相同的:构建一个char频率图,按频率按降序排序,并用字符构建一个字符串。

答案 3 :(得分:-1)

我对流和lambdas一无所知,但我会这样做:

result <- merge(df, labels, by="D")[, union(names(df), names(labels))]

计算出现次数。然后,它只是在事后从最高到最低排序。