Question

我有一个超过800000行的数据集，每个偶数行都是它之前的奇数行的副本。我想删除重复项。请有人帮忙吗？

Answer 1

可以尝试使用它，它使用缓冲读取和写入逐行读/写，跳过每一个。（目前无法访问编译器以获取任何小错误，如果您有任何问题评论，我会编辑，好吗？）

Charset charset = Charset.forName("US-ASCII"); //Change to the right charset
Path toRead = Paths.get("largefile.txt");
Path toWrite = Paths.get("filteredfile.txt");
try (BufferedReader reader = Files.newBufferedReader(toRead, charset)) {
    String line = null;
    int skip=0;
    while ((line = reader.readLine()) != null) {
        if(skip==0)
        {
            skip=1;
            try (BufferedWriter writer = Files.newBufferedWriter(toWrite, charset)) {
                writer.write(line, 0, line.length());
                writer.newLine();
                writer.close();
            } catch (IOException x) {
                System.err.format("IOException: %s%n", x);
            }
        }
        else skip=0;
    }
} catch (IOException x) {
    System.err.format("IOException: %s%n", x);
}

Answer 2

我认为你应该提供更多有关此事，编程语言等的信息......

我的猜测是你应该更改查询以避免重复（甚至使用“distinct”应该工作）。

请发布更多信息，以便我们为您提供帮助。

从大型数据集中删除重复项

2 个答案: