Question

我需要那些非常了解java和内存问题的人的建议。我有一个大的CSV文件（每个500mb），我需要使用64mb的xmx合并这些文件。我尝试过不同的方式，但没有任何作用 - 总是有内存异常。我该怎么做才能让它正常工作？

任务是：开发一个简单的实现，以合理有效的方式连接两个输入表，并在需要时将这两个表存储在RAM中。

我的代码有效，但它需要很多内存，所以不能适合64mb。

public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {

    RandomAccessFile firstFile     = new RandomAccessFile("input_A.csv", "r");
    FileChannel      firstChannel = firstFile.getChannel();
    RandomAccessFile secondFile     = new RandomAccessFile("input_B.csv", "r");
    FileChannel      secondChannel = secondFile.getChannel();
    RandomAccessFile resultFile     = new RandomAccessFile("result2.csv", "rw");
    FileChannel      resultChannel = resultFile.getChannel().position(0);

    ByteBuffer resultBuffer = ByteBuffer.allocate(40);
    ByteBuffer firstBuffer = ByteBuffer.allocate(25);
    ByteBuffer secondBuffer = ByteBuffer.allocate(25);

    while (secondChannel.position() != secondChannel.size()){
        Map <String, List<String>>table2Part = new HashMap();
        for (int i = 0; i < secondChannel.size(); ++i){
            if (secondChannel.read(secondBuffer) == -1)
                break;
            secondBuffer.rewind();
            String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
            if (!table2Part.containsKey(table2Tuple[0]))
                table2Part.put(table2Tuple[0], new ArrayList());
            table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
            secondBuffer.clear();
        }

        Set <String> taple2keys = table2Part.keySet();
        while (firstChannel.read(firstBuffer) != -1){
            firstBuffer.rewind();
            String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
            for (String table2key : taple2keys){
                if (table1Tuple[0].equals(table2key)){
                    for (String value : table2Part.get(table2key)){
                        String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
                        resultBuffer.put(result.getBytes());
                        resultBuffer.rewind();
                        while(resultBuffer.hasRemaining()){
                            resultChannel.write(resultBuffer);
                        }
                        resultBuffer.clear();
                    }
                }
            }
            firstBuffer.clear();
        }
        firstChannel.position(0);
        table2Part.clear();
    }

    firstChannel.close();
    secondChannel.close();
    resultChannel.close();
    System.out.println("Operation completed.");
}
}

Answer 1

外部联接的一个非常容易实现的版本是external hash join。它比外部合并排序连接更容易实现，只有一个缺点（稍后会有更多）。

它是如何运作的？

非常类似于哈希表。选择一个号码n，表示您将数据分发到的文件数量（＆＃34;存储桶＆＃34;）。

然后执行以下操作：

设置n文件编写者
对于您要加入的每个文件以及每行：
- 获取您要加入的密钥的哈希码
- 计算哈希码的模数和n，这将为您提供k
- 将您的csv行附加到k文件编写器
同花顺/关闭所有n作家。

现在你有n，希望更小的文件，保证同一个密钥永远在同一个文件中。现在，您可以分别对这些文件中的每个文件运行基于HashMap/HashMultiSet的标准连接。

<强>限制

为什么我提到希望较小的文件？嗯，这取决于密钥及其哈希码的分布。想想最糟糕的情况，你的所有文件都有完全相同的密钥：你只有一个文件而你没有从分区中获得任何好处。

类似于倾斜的发行版，有时您的一些存储桶文件太大而无法放入RAM中。通常有三种方法摆脱这种困境：

使用更大的n再次运行算法，因此您可以使用更多存储桶分发到
只使用太大的存储桶并且只对这些文件执行另一个散列分区（因此每个文件再次进入n新创建的存储桶）
回退到大分区文件的外部合并排序。

有时这三种组合都使用不同的组合，称为动态分区。

Answer 2

如果中央内存是你的应用程序的约束但你可以访问一个持久文件，我会按照blahfunk的建议创建一个临时的SQLite文件到你的tmp文件夹，按块读取每个文件并将它们与一个简单的连接合并。您可以通过查看Hibernate等库来创建临时SQLite数据库，只需查看我在此StackOverflow问题上找到的内容：How to create database in Hibernate at runtime?

如果你不能执行这样的任务，你剩下的选择是消耗更多的cpu并加载第一个文件的第一行，在第二个文件中搜索具有相同索引的行，缓冲结果并将它们刷新为迟到尽可能在输出文件中，对第一个文件的每一行重复此操作。

Answer 3

也许您可以流式传输first file并将每一行转换为hashcode并将所有hashcodes保存在内存中。然后流式传输second file，并为每一行输入hashcode。如果hashcode位于first file，即在内存中，则不要{＆1;}。写线，否则写线。之后，将first file完整地附加到result file。

这将有效地创建一个索引来比较您的更新。

使用内部联接合并2个大型csv文件

3 个答案: