逐行比较java中的大文本文件

时间:2018-02-14 10:59:03

标签: java string file

亲爱的开发人员我正在做一个java程序,它逐行比较两个文本文件,第一个文本文件有99,000行,另一个文件有1,15,000行。我想读取文件并以这种方式进行比较,以便如果第一个文件和第二个文件之间的任何行匹配,它应该打印匹配。我已经编写了代码,但由于for循环,它需要大约10分钟才能完成打印。如何使其快速,高效和内存优化。如何让它快速执行?请指导我。感谢

public class Main {

static final String file1 = "file1.txt";
static final String file2 = "file2.txt";

static BufferedReader b1 = null;
static BufferedReader b2 = null;

static List<String> list_file1 = null;
static List<String> list_file2 = null;

public static void main(String[] args) {

    list_file1 = new ArrayList<String>();
    list_file2 = new ArrayList<String>();

    String lineText = null;

    try {
        b1 = new BufferedReader(new FileReader(file1));
        while ((lineText = b1.readLine()) != null) {
            list_file1.add(lineText);
        }
        b2 = new BufferedReader(new FileReader(file2));
        while ((lineText = b2.readLine()) != null) {
            list_file2.add(lineText);
        }
        compareFile(list_file1,list_file2);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

}

private static void compareFile(List<String> list_file1, List<String> list_file2) {
    for(String content1:list_file1){
        for(String content2:list_file2){
            if(content1.equals(content2)){
                System.out.println("Match Found:-"+content1);
            }
        }
    }
}
} 

3 个答案:

答案 0 :(得分:0)

使用HashSet及其contains方法:

public class Main {

    static final String file1 = "/tmp/file1";
    static final String file2 = "/tmp/file2";
    static BufferedReader b1 = null;
    static BufferedReader b2 = null;
    static Set<String> list_file1 = null;

    public static void main(String[] args) {
        list_file1 = new HashSet<>();
        String lineText = null;
        try {
            b1 = new BufferedReader(new FileReader(file1));
            while ((lineText = b1.readLine()) != null) {
                list_file1.add(lineText);
            }
            b2 = new BufferedReader(new FileReader(file2));
            while ((lineText = b2.readLine()) != null) {
                if (list_file1.contains(lineText)) {
                    System.out.println("Match Found:-" + lineText);
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

如果文件有重复项,则可以使用HashMap代替:

public class Main {

    static final String file1 = "/tmp/file1";
    static final String file2 = "/tmp/file2";
    static BufferedReader b1 = null;
    static BufferedReader b2 = null;
    static HashMap<String, Integer> list_file1 = null;

    public static void main(String[] args) {
        list_file1 = new HashMap<>();
        String lineText = null;
        try {
            b1 = new BufferedReader(new FileReader(file1));
            while ((lineText = b1.readLine()) != null) {
                if (!list_file1.containsKey(lineText))
                    list_file1.put(lineText, 1);
                else
                    list_file1.put(lineText, list_file1.get(lineText) + 1);
            }
            b2 = new BufferedReader(new FileReader(file2));
            while ((lineText = b2.readLine()) != null) {
                if (list_file1.containsKey(lineText)) {
                    for (int i = 0; i < list_file1.get(lineText); i++) {
                        System.out.println("Match Found:-" + lineText);
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

答案 1 :(得分:0)

尝试以下内容。我删除了你的代码的一次迭代仍然可以有重复。在这里我使用java 8 stream。

public static void main(String[] args) {
  final String file1 = "file1.txt";
  final String file2 = "file2.txt";
  BufferedReader b1;
  BufferedReader b2;

  List<String> list_file1 = new ArrayList<>();

  String lineText;

  try {
    b1 = new BufferedReader(new FileReader(file1));
    while ((lineText = b1.readLine()) != null) {
      list_file1.add(lineText);
    }
    b2 = new BufferedReader(new FileReader(file2));
    while ((lineText = b2.readLine()) != null) {
      final String text = lineText;
      list_file1.stream().filter(s -> s.equalsIgnoreCase(text)).forEach(s -> System.out.println("Match Found:-" + text));
    }
  }
  catch (IOException e) {
    e.printStackTrace();
  }
} 

答案 2 :(得分:0)

以下按照与您自己的代码相同的顺序打印行:即,如果文件1的行中包含“one”,“two”,而文件2的行中包含“two”,“one”顺序,然后输出将是“一”,“两”。为此,我们首先阅读文件2并构建行的映射和每行的出现次数:

static void printDuplicateLines(String filename1, String filename2) throws IOException {

    // Index the lines of file 2 with a map of line -> count
    Map<String, Integer> linesOfFile2 = new HashMap<>();
    try (Stream<String> lines = Files.lines(Paths.get(filename2))) {
        lines.forEach(line -> linesOfFile2.merge(line, 1, (oldValue, x) -> oldValue + 1));
    }

    // Check file 1 to see which lines are duplicate
    try (Stream<String> lines = Files.lines(Paths.get(filename1))) {
        lines.forEach(line -> {
                    int countOccurrencesInFile2 = linesOfFile2.getOrDefault(line, 0);
                    for (int i = 1; i <= countOccurrencesInFile2; i++)
                        System.out.println("Match Found:-" + line);
                }
        );
    }
}

然后我们逐行读取文件1,找出文件2中该行的出现次数(如果没有,则为0)并多次打印该行。

请注意使用try-with-resources确保文件正确关闭。