Question

我需要分析两个大数据文件之间的差异，每个文件应该具有相同的结构。每个文件的大小为几千兆字节，可能有3000万行或文本数据。数据文件非常大，我不愿意将每个数据文件加载到自己的数组中，这时可能更容易按顺序迭代这些行。每一行都有结构：

topicIdx, recordIdx, other fields...

topicIdx和recordIdx是顺序的，从零开始，每次迭代递增+1，因此很容易在文件中找到它们。（不需要搜索;只需按顺序递增）。

我需要做类似的事情：

for each line in fileA  
    store line in String itemsA  
       get topicIdx and recordIdx  
           find line in fileB with same topicIdx and recordIdx  
               if exists  
                   store this line in string itemsB  
                       for each item in itemsA  
                           compare value with same index in itemsB  
                               if these two items are not virtually equal  
                                   //do something  
                else  
                    //do something else

我用FileReader和BufferedReader编写了以下代码，但是这些代码似乎没有提供我需要的功能。任何人都可以告诉我如何修复下面的代码，以便它完成我想要的东西？

void checkData(){  
    FileReader FileReaderA;  
    FileReader FileReaderB;  
    int topicIdx = 0;  
    int recordIdx = 0;  
    try {  
        int numLines = 0;
        FileReaderA = new FileReader("B:\\mypath\\fileA.txt");  
        FileReaderB = new FileReader("B:\\mypath\\fileB.txt");  
        BufferedReader readerA = new BufferedReader(FileReaderA);  
        BufferedReader readerB = new BufferedReader(FileReaderB);
        String lineA = null;
        while ((lineA = readerA.readLine()) != null) {
            if (lineA != null && !lineA.isEmpty()) {
                List<String> itemsA = Arrays.asList(lineA.split("\\s*,\\s*"));
                topicIdx = Integer.parseInt(itemsA.get(0));
                recordIdx = Integer.parseInt(itemsA.get(1));
                String lineB = null;
                //lineB = readerB.readLine();//i know this syntax is wrong
                setB = rows from FileReaderB where itemsB.get(0).equals(itemsA.get(0));
                for each lineB in setB{
                    List<String> itemsB = Arrays.asList(lineB.split("\\s*,\\s*"));
                    for(int m = 0;m<itemsB.size();m++){}
                    for(int j=0;j<itemsA.size();j++){  
                    double myDblA = Double.parseDouble(itemsA.get(j));  
                    double myDblB = Double.parseDouble(itemsB.get(j));  
                    if(Math.abs(myDblA-myDblB)>0.0001){  
                        //do something  
                    }  
                 }  
            }  
        }  
        readerA.close();  
    }   catch (IOException e) {e.printStackTrace();}  
}

Answer 1

如果您确实需要使用Java，为什么不使用java-diff-utils？它实现了一个众所周知的差异算法。

Answer 2

您需要按搜索键（recordIdx和topicIdx）排序的两个文件，这样您就可以进行类似的合并操作

open file 1
open file 2
read lineA from file1
read lineB from file2
while (there is lineA and lineB) 
    if (key lineB < key lineA) 
        read lineB from file 2
        continue loop
    if (key lineB > key lineA)
        read lineA from file 1
        continue
    // at this point, you have lineA and lineB with matching keys
    process your data
    read lineB from file 2

请注意，内存中只有两条记录。

Answer 3

考虑https://code.google.com/p/java-diff-utils/。让其他人做重担。

从两个大文件逐行比较数据

3 个答案: