Question

我有一个巨大的csv文件（500MB）和400k记录

id, name, comment, text
1, Alex, Hello, I believe in you

列文本包含大量信息和句子。我想得到这个列（“文本”），将所有非字母符号替换为“”，然后按照与“文本”列中最常见的单词相反的顺序对其进行排序，最常见的是限制为1000.这就是它的样子。我正在使用CsvReader库

CsvReader doc = new CsvReader("My CSV Name");
        doc.readHeaders();
        try {
            List<String> listWords = new ArrayList<>();
            while (doc.readRecord()) {
                listWords.addAll(Arrays.asList(doc.get("Text"/*my column name*/).replaceAll("\\P{Alpha}", " ").toLowerCase().trim().split("[ ]+")));
            }

            Map<String, Long> sortedText = listWords.stream()
                    .collect(groupingBy(chr -> chr, counting()))
                    .entrySet().stream()
                    .sorted(Map.Entry.comparingByValue(Collections.reverseOrder()))
                    .limit(1000)
                    .collect(Collectors.toMap(
                            Map.Entry::getKey,
                            Map.Entry::getValue,
                            (e1, e2) -> e1,
                            LinkedHashMap::new
                    ));
            sortedText.forEach((k, v) -> System.out.println("Word: " + k + " || " + "Count: " + v));
            doc.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            doc.close();
        }

运行后，我的GC超出了内存不足的错误。怎么做得最好？我无法增加堆大小，我只需要使用默认设置

Answer 1

针对该问题的建议：不要在listWords中添加所有字词，而是尝试按处理的每个CSV行对字词进行记帐。

代码将是这样的：

CsvReader doc = null;

try {

    doc = new CsvReader(""My CSV Name");
    doc.readHeaders();

    Map<String, Long> mostFrequent = new HashMap<String, Long>();

    while (doc.readRecord()) {

        Arrays.asList(doc.get("text"/*my column name*/).replaceAll("\\P{Alpha}", " ").toLowerCase().trim().split("[ ]+")).
        stream().forEach(word -> {

            if (mostFrequent.containsKey(word)) {
                mostFrequent.put(word, mostFrequent.get(word) + 1);  
            }
            else {
                mostFrequent.put(word, 1l);
            }
        });
    }

    Map<String, Long> sortedText = mostFrequent.entrySet().stream()
        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
        .limit(1000)
        .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue,
                (e1, e2) -> e1, LinkedHashMap::new));

    sortedText.forEach((k, v) -> System.out.println("Word: " + k + " || " + "Count: " + v));

    doc.close();

} catch (IOException e) {
    e.printStackTrace();
} finally {
    doc.close();
}

解析CSV时出现OutOfMemoryError

1 个答案: