Question

我有一个 20GB 文件夹，其中包含 358 txt文件，总共 733,019,372 行，所有txt文件格式都在

77 clueweb12-0211wb-83-00000
88 clueweb12-0211wb-83-00001
82 clueweb12-0211wb-83-00002
82 clueweb12-0211wb-83-00003
64 clueweb12-0211wb-83-00004
80 clueweb12-0211wb-83-00005
83 clueweb12-0211wb-83-00006
75 clueweb12-0211wb-83-00007

我的目的是当程序遍历所有txt文件时逐行递归读取文件，每行(e.g 88 and clueweb12-0211wb-83-0003)分成两部分，并将这些部分放入LinkedHashMap<String, List<String>>。除此之外，将来自用户的docIds (clueweb12-0211wb-83-00006)作为参数并将得分属于此docIds (83)。如果遇到不存在的docID，则应返回-1作为分数。例如：

clueweb12-0003wb-22-11553,foo,clueweb12-0109wb-78-15059,bar,clueweb12-0302wb-50-22339

应该打印出来：84,-1,19,-1,79

我从用户那里获取文件路径作为参数。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.*;


import static java.nio.file.FileVisitResult.CONTINUE;



public class App extends SimpleFileVisitor<Path>{

    public LinkedHashMap<String, List<String>> list = new LinkedHashMap<>(); // Put there scores and docIds

    @Override
    public FileVisitResult visitFile(Path path, BasicFileAttributes attr) throws IOException {

     File file =  new File(path.toString());
     BufferedReader br = new BufferedReader(new FileReader(file));
     String line;

    while((line = br.readLine()) != null){


        if(list.containsKey(line.split(" ")[0])){
            list.get(line.split(" ")[0]).add(line.split(" ")[1]);
        }
        else{
            list.put(line.split(" ")[0],new ArrayList(Arrays.asList(line.split(" ")[1])));
        }

    }
        return CONTINUE;
    }


    public static void main(String args[]) throws IOException {




        if (args.length < 2) {
            System.err.println("Usage: java App spamDir docIDs ...");
            return;
        }
        Path spamDir = Paths.get(args[0]);
        String[] docIDs = args[1].split(",");

        App ap = new App();
        Files.walkFileTree(spamDir, ap);
        ArrayList scores = new ArrayList(); // keep scores in that list

        //Search the Lists in LinkedHashMap
        for(int j=0; j<docIDs.length; j++){
            Set set = ap.list.entrySet();
            Iterator i = set.iterator();
            int counter = 0;
            while(i.hasNext()){

                // if LinkedHashMap has the docID add it to scores List
                Map.Entry me = (Map.Entry) i.next();
                ArrayList searchList = (ArrayList) me.getValue();
                if(searchList.contains(docIDs[j])){
                    scores.add(me.getKey());
                    counter++;
                    break;


                }
                else {

                    continue;
                }

            }
            // if LinkedHashMap has not the docId add -1 to scores List
            if(counter == 0){
                scores.add("-1");
            }

        }

        String joined = String.join("," , scores);
        System.out.println(joined);

    }
}

但我遇到了这个问题：

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOf(Arrays.java:3181)
    at java.util.ArrayList.grow(ArrayList.java:261)
    at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
    at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
    at java.util.ArrayList.add(ArrayList.java:458)
    at ceng.bim208.App.visitFile(App.java:35)
    at ceng.bim208.App.visitFile(App.java:18)
    at java.nio.file.Files.walkFileTree(Files.java:2670)
    at java.nio.file.Files.walkFileTree(Files.java:2742)
    at ceng.bim208.App.main(App.java:58)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

我尝试使用XmX2048M来增加堆大小，但它并没有解决我的问题。我该怎么办？

此外，如果我在不同的路径上运行程序（包括2个txt文件相同的格式），它可以正常工作。

Answer 1

这听起来像是一个面试家庭作业。我敢打赌，这里的任务是进行精神上的飞跃，并将你在内存中必须保存的数据与解析文件时可以存储在索引中的数据分开。

无论你有多少记忆，如果你继续这样做，你最终会用尽它。在您的情况下，您可以使用一些有用的提示来解决此问题：

不要把所有东西都放进记忆中。如果可以建立一个仅包含必要数据的单独文件，则构建索引
将文件作为流处理：这意味着您逐行逐行解析InputStream，这样您就不必将它们保存在内存中。

在你的情况下：

public LinkedHashMap<String, List<String>> list

使用已解析的String填充内存。根据我的理解，您不需要存储String自己，而只需要存储分数。如果你澄清你的任务是什么，我可以进一步帮助你，但目前还不清楚你的任务是什么。

我的任务是将docIds作为命令行参数并打印出他们的分数。

您需要的是查找分数：

Map<String, Map<Integer, Integer>> docIdsWithScoresAndCounts;

或

Map<String, List<Integer>> docIdsWithScores;

取决于您是否要计算分数出现的次数。外部Map将文档ID保存为键，内部地图自行查找score -> count。这是计数排序算法的一个棘手的变化：您只需要跟踪文档ID和每个文档ID的分数，因为分数的大小有限（它们可以有多少位数？）您最终得到{ {1}}内存消耗。其余的数据可以扔掉。

请注意您只需要来存储您感兴趣的文档ID的键。您可以扔掉其余部分。

Answer 2

以下是新方法，并纠正了一些错误。

public Map<String, List<String>> map = new HashMap<>();

@Override
public FileVisitResult visitFile(Path path, BasicFileAttributes attr)
        throws IOException {

    Files.lines(path).forEach(line -> {
        String[] keyValue = line.split(" ", 2);
        map.compute(keyValue[0],
             (key, oldList) -> {
                  List<String> list = oldList == null
                      ? new ArrayList<>()
                      : oldList;
                  list.add(keyValue[1]);
                  return list;
             }); 
    });
    return CONTINUE;
}

LinkedHashMap维护添加的顺序，这不必要地耗费内存。
分裂应该进行一次。
该文件应该关闭。我使用允许简洁编码的Files.lines。
（提示）未给出Charset（编码），因此默认为平台。有人可能会考虑将其添加为参数。
Map.compute可以很方便地决定旧值（列表）是否创建新值。

可以通过不存储List<String>来节省内存，但是像List<byte[]>那样存储类似的字节：

byte[] bytes = keyValue[1].getBytes(Charset.defaultCharset());
String s = new String(bytes, Charset.defaultCharset());

与普通ASCII的字符串相比，您将节省一半的字节（char是两个字节）。

可能一个数据库，比如嵌入式Java Derby或H2会做得更好。

读取巨大的txt文件时超出了GC开销限制

2 个答案: