我读了两个文件(一个大文件和一个小文件)
第一个文件包含大写字母和小写字母 但第二个文件只包含大写字母。
程序首先从第一个(大文件)文件中提取大写字母,然后与第二个文件(仅包含大写字母)进行比较。
我的代码在大文件较小时效果很好但是当我的文件大小约为400MB时,程序显示内部错误"Java Out of Memory Error"
。
这是我的代码:
public class SequenceComparator {
private ArrayList<Sequence> bigSequences;
private ArrayList<Sequence> smallSequences;
public SequenceComparator() {
bigSequences = new ArrayList<Sequence>();
smallSequences = new ArrayList<Sequence>();
}
private String splitUpperSequences(String bigSeq) {
StringBuilder sb = new StringBuilder();
for (char c : bigSeq.toCharArray()) {
if (Character.isLetter(c) && Character.isUpperCase(c)) {
sb.append(c);
}
}
return sb.toString();
}
public void readBigSequences() throws FileNotFoundException {
Scanner s = new Scanner(new FileReader("test_ref_Aviso_bristol_k_31_c_4.fa"));
while (s.hasNextLine()) {
String title = s.nextLine();
if (!title.startsWith(">")) {
continue;
}
String seq = s.nextLine();
Sequence sequence = new Sequence(title, splitUpperSequences(seq).trim());
bigSequences.add(sequence);
}
s.close();
}
public void readSmallSequences() throws FileNotFoundException {
Scanner s = new Scanner(new FileReader("SNP12K.fasta"));
while (s.hasNextLine()) {
String title = s.nextLine().trim();
if (!title.startsWith(">")) {
continue;
}
String seq = s.nextLine().trim();
Sequence sequence = new Sequence(title, seq);
smallSequences.add(sequence);
}
s.close();
}
public void printSeqArray(ArrayList<Sequence> seqArray) {
for (Sequence sequence : seqArray) {
System.out.println(sequence);
}
}
private void reportNotFoundSeqs(ArrayList<Sequence> notFoundSeqs) {
System.out.println("Sequence that is not similar with big file:\n\n");
printSeqArray(notFoundSeqs);
}
public void comparison() {
int bigLength = bigSequences.size();
int smallLength = smallSequences.size();
System.out.println("Sequences Length of big file is " + bigLength);
System.out.println("Sequences Length of small file is " + smallLength);
System.out.println("\n");
if (bigLength > smallLength) {
System.out.println("big file has " + (bigLength - smallLength) + " sequence more than smal file");
} else {
System.out.println("small file has " + (smallLength - bigLength) + " sequence more than big file");
}
System.out.println("\n");
int s = 0;
ArrayList<Sequence> notFoundSeqs = new ArrayList<Sequence>();
for (Sequence smalSeq : smallSequences) {
if (bigSequences.contains(smalSeq)) {
s++;
} else {
notFoundSeqs.add(smalSeq);
}
}
System.out.println("Tow files are similar in " + s + " point");
System.out.println("\n");
reportNotFoundSeqs(notFoundSeqs);
}
public ArrayList<Sequence> getBigSequences() {
return bigSequences;
}
public ArrayList<Sequence> getSmallSequences() {
return smallSequences;
}
static public void main(String args[]) throws FileNotFoundException {
SequenceComparator sc = new SequenceComparator();
System.out.println("Reading files...");
long befor = System.currentTimeMillis();
sc.readBigSequences();
System.out.println("\nBig file upper sequences:\n");
sc.printSeqArray(sc.getBigSequences());
sc.readSmallSequences();
sc.comparison();
long afer = System.currentTimeMillis();
System.out.println("Time: "+((afer-befor)/1000)+" Seconds"); }
class Sequence {
private String title;
private String seq;
public Sequence(String title, String seq) {
this.seq = seq;
this.title = title;
}
public Sequence() {
}
public String getSeq() {
return seq;
}
public void setSeq(String seq) {
this.seq = seq;
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
@Override
public String toString() {
return "\nTitle: " + title + "\n" + "Sequence: " + seq + "\n";
}
@Override
public boolean equals(Object obj) {
Sequence other = (Sequence) obj;
return seq.equals(other.seq);
}
}
}
我该怎么办?
答案 0 :(得分:0)
您正在将整个文件内容加载到内存中。这就是你得到内存不足错误的原因。请尝试创建格式相同的临时文件,并逐行比较两个文件。最后删除临时文件。
示例:
private File prepareFile(String rawFirstFile) throws FileNotFoundException, IOException {
File tempFile = new File("rawFirstFile_temp.dat");
try(BufferedReader br = Files.newBufferedReader(new File(rawFirstFile).toPath());
BufferedWriter wr = Files.newBufferedWriter(tempFile.toPath(), StandardOpenOption.WRITE)){
String line = null;
while ((line = br.readLine()) != null) {
//
// change the raw line, save filtered data as new line in your temp file
// what ever you want. example:
// title1,seq1
// title2,seq2
// ...
//
wr.write(changeLine(line));
wr.newLine();
}
return tempFile;
}
}
public void compareFiles(String firstFile, String seconFile) throws FileNotFoundException, IOException{
File tempFirstFile = prepareFile(firstFile);
File secondFile = new File(seconFile); // maybe need to prepare too
try(BufferedReader br1 = Files.newBufferedReader(tempFirstFile.toPath());
BufferedReader br2 = Files.newBufferedReader(secondFile.toPath())){
String line1File = null;
String line2File = null;
// line by line
while ((line1File = br1.readLine()) != null && (line2File = br2.readLine()) != null ) {
//
// compare them
//
}
} finally {
if(tempFirstFile != null){
//tempFirstFile.deleteOnExit();
Files.delete(tempFirstFile.toPath()); // has no effect if deleteOnExit was called!
}
}
}
private String changeLine(String rawLine) {
//TODO
return rawLine;
}
编辑:将try-catch更改为try-with语句以使答案更加智能
答案 1 :(得分:0)
你应该逐字节进行比较,而不是像dit所提到的那样逐行进行。如果文件的来源未知,则使用readLine()很容易受到攻击。