用Java比较两个大文本文件

时间:2014-07-10 16:06:48

标签: java

我读了两个文件(一个大文件和一个小文件)

第一个文件包含大写字母和小写字母 但第二个文件只包含大写字母。

程序首先从第一个(大文件)文件中提取大写字母,然后与第二个文件(仅包含大写字母)进行比较。

我的代码在大文件较小时效果很好但是当我的文件大小约为400MB时,程序显示内部错误"Java Out of Memory Error"

这是我的代码:

public class SequenceComparator {

private ArrayList<Sequence> bigSequences;
private ArrayList<Sequence> smallSequences;

public SequenceComparator() {
    bigSequences = new ArrayList<Sequence>();
    smallSequences = new ArrayList<Sequence>();
}

private String splitUpperSequences(String bigSeq) {

    StringBuilder sb = new StringBuilder();
    for (char c : bigSeq.toCharArray()) {
        if (Character.isLetter(c) && Character.isUpperCase(c)) {
            sb.append(c);
        }
    }
    return sb.toString();
}

public void readBigSequences() throws FileNotFoundException {
    Scanner s = new Scanner(new FileReader("test_ref_Aviso_bristol_k_31_c_4.fa"));
    while (s.hasNextLine()) {
        String title = s.nextLine();
        if (!title.startsWith(">")) {
            continue;
        }
        String seq = s.nextLine();
        Sequence sequence = new Sequence(title, splitUpperSequences(seq).trim());
        bigSequences.add(sequence);
    }
    s.close();

}

public void readSmallSequences() throws FileNotFoundException {
    Scanner s = new Scanner(new FileReader("SNP12K.fasta"));
    while (s.hasNextLine()) {
        String title = s.nextLine().trim();
        if (!title.startsWith(">")) {
            continue;
        }
        String seq = s.nextLine().trim();
        Sequence sequence = new Sequence(title, seq);
        smallSequences.add(sequence);
    }
    s.close();

}

public void printSeqArray(ArrayList<Sequence> seqArray) {
    for (Sequence sequence : seqArray) {
        System.out.println(sequence);
    }
}

private void reportNotFoundSeqs(ArrayList<Sequence> notFoundSeqs) {
    System.out.println("Sequence that is not similar with big file:\n\n");
    printSeqArray(notFoundSeqs);
}

public void comparison() {
    int bigLength = bigSequences.size();
    int smallLength = smallSequences.size();
    System.out.println("Sequences Length of big file is " + bigLength);
    System.out.println("Sequences Length of small file is " + smallLength);
    System.out.println("\n");
    if (bigLength > smallLength) {
        System.out.println("big file has " + (bigLength - smallLength) + " sequence more than smal file");
    } else {
        System.out.println("small file has " + (smallLength - bigLength) + " sequence more than big file");
    }
    System.out.println("\n");
    int s = 0;
    ArrayList<Sequence> notFoundSeqs = new ArrayList<Sequence>();
    for (Sequence smalSeq : smallSequences) {
        if (bigSequences.contains(smalSeq)) {
            s++;
        } else {
            notFoundSeqs.add(smalSeq);
        }
    }
    System.out.println("Tow files are similar in " + s + " point");
    System.out.println("\n");
    reportNotFoundSeqs(notFoundSeqs);

}

public ArrayList<Sequence> getBigSequences() {
    return bigSequences;
}

public ArrayList<Sequence> getSmallSequences() {
    return smallSequences;
}

static public void main(String args[]) throws FileNotFoundException { 
    SequenceComparator sc = new SequenceComparator();

System.out.println("Reading files..."); 
long befor = System.currentTimeMillis(); 
sc.readBigSequences();
System.out.println("\nBig file upper sequences:\n");
sc.printSeqArray(sc.getBigSequences());

sc.readSmallSequences();

sc.comparison(); 
long afer = System.currentTimeMillis(); 
System.out.println("Time: "+((afer-befor)/1000)+" Seconds"); }

class Sequence {

    private String title;
    private String seq;

    public Sequence(String title, String seq) {
        this.seq = seq;
        this.title = title;
    }

    public Sequence() {
    }

    public String getSeq() {
        return seq;
    }

    public void setSeq(String seq) {
        this.seq = seq;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    @Override
    public String toString() {
        return "\nTitle: " + title + "\n" + "Sequence: " + seq + "\n";
    }

    @Override
    public boolean equals(Object obj) {
        Sequence other = (Sequence) obj;
        return seq.equals(other.seq);
    }

}
}

我该怎么办?

2 个答案:

答案 0 :(得分:0)

您正在将整个文件内容加载到内存中。这就是你得到内存不足错误的原因。请尝试创建格式相同的临时文件,并逐行比较两个文件。最后删除临时文件。

示例:

private File prepareFile(String rawFirstFile) throws FileNotFoundException, IOException {

    File tempFile = new File("rawFirstFile_temp.dat");

    try(BufferedReader br = Files.newBufferedReader(new File(rawFirstFile).toPath());
        BufferedWriter wr = Files.newBufferedWriter(tempFile.toPath(), StandardOpenOption.WRITE)){

        String line = null;

        while ((line = br.readLine()) != null) {
            //
            // change the raw line, save filtered data as new line in your temp file
            // what ever you want. example:
            //      title1,seq1
            //      title2,seq2
            //      ...
            //
            wr.write(changeLine(line));
            wr.newLine();
        }
        return tempFile;
    }
}

public void compareFiles(String firstFile, String seconFile) throws FileNotFoundException, IOException{

    File tempFirstFile  = prepareFile(firstFile);
    File secondFile     = new File(seconFile); // maybe need to prepare too

    try(BufferedReader  br1 = Files.newBufferedReader(tempFirstFile.toPath());
        BufferedReader  br2 = Files.newBufferedReader(secondFile.toPath())){

        String line1File = null;
        String line2File = null;

        // line by line
        while ((line1File = br1.readLine()) != null && (line2File = br2.readLine()) != null ) {
            //
            // compare them
            //
        }
    } finally {
        if(tempFirstFile != null){
            //tempFirstFile.deleteOnExit(); 
            Files.delete(tempFirstFile.toPath()); // has no effect if deleteOnExit was called!
        }
    }

}

private String changeLine(String rawLine) {
    //TODO
    return rawLine;
}

编辑:将try-catch更改为try-with语句以使答案更加智能

答案 1 :(得分:0)

你应该逐字节进行比较,而不是像dit所提到的那样逐行进行。如果文件的来源未知,则使用readLine()很容易受到攻击。