在Java字符串搜索期间突然减速和java.lang.OutOfMemoryError

时间:2012-08-17 03:14:16

标签: java string search memory java.util.scanner

我正在编写一个主要用于RNA序列中的模式发现的程序。为了在序列中找到“模式”,我正在生成一些可能的模式并扫描所有序列的输入文件(算法有更多,但这是破坏的位)。生成的可能模式具有用户指定的指定长度。

适用于长度不超过8个字符的所有序列长度。然后在9,程序运行很长时间,然后给出java.lang.OutOfMemoryError。经过一些调试,我发现弱点是模式生成方法:

/* Get elementary pattern (ep) substrings, to later combine into full patterns */
public static void init_ep_subs(int length) {

ep_subs = new ArrayList<Substring>(); // clear static ep_subs data field

/* ep subs are of the form C1...C2...C3 where C1, C2, C3 are characters in the
   alphabet and the whole length of the string is equal to the input parameter
   'length'. The number of dots varies for different lengths.
The middle character C2 can occur instead of any dot, or not at all.*/

for (int i = 1; i < length-1; i++) { // for each potential position of C2

    // for each alphabet character to be C1
    for (int first = 0; first < alphabet.length; first++) { 

    // for each alphabet character to be C3
    for (int last = 0; last < alphabet.length; last++) {

        // make blank pattern, i.e. no C2
        Substring s_blank = new Substring(-1, alphabet[first],
                          '0', alphabet[last]);

        // get its frequency in the input string
        s_blank.occurrences = search_sequences(s_blank.toString());

        // if blank ep is found frequently enough in the input string, store it
        if (s_blank.frequency()>=nP) ep_subs.add(s_blank);

        // when C2 is present, for each character it could be
        for (int mid = 0; mid < alphabet.length; mid++) {

        // make pattern C1,C2,C3
        Substring s = new Substring(i, alphabet[first],
                        alphabet[mid],
                        alphabet[last]);

        // search input string for pattern s
        s.occurrences = search_sequences(s.toString());

        // if s is frequent enough, store it
        if (s.frequency()>=nP) ep_subs.add(s);
        }
    }
    }
}
}

接下来会发生什么:当我对search_sequences进行计时时,它们的开始时间大约为40-100ms,并以第一种模式的方式进行。然后在几百个模式(大约'C ..... G.C')之后,这些呼叫突然开始花费大约十倍,1000-2000毫秒。之后,时间稳步增加,直到大约12000ms('C ...... TA'),它给出了这个错误:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3209)
    at java.lang.String.<init>(String.java:215)
    at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
    at java.nio.CharBuffer.toString(CharBuffer.java:1157)
    at java.util.regex.Matcher.toMatchResult(Matcher.java:232)
    at java.util.Scanner.match(Scanner.java:1270)
    at java.util.Scanner.hasNextLine(Scanner.java:1478)
    at PatternFinder4.search_sequences(PatternFinder4.java:217)
    at PatternFinder4.init_ep_subs(PatternFinder4.java:256)
    at PatternFinder4.main(PatternFinder4.java:62)

这是search_sequences方法:

/* Searches the input string 'sequences' for occurrences of the parameter string 'sub' */
public static ArrayList<int[]> search_sequences(String sub) {

/* arraylist returned holding int arrays with coordinates of the places where 'sub'
 was found, i.e. {l,i} l = lines number, i = index within line */
ArrayList<int[]> occurrences = new ArrayList<int[]>();
s = new Scanner(sequences);
int line_index = 0;

String line = "";
while (s.hasNextLine()) {
    line = s.nextLine();
    pattern = Pattern.compile(sub);
    matcher = pattern.matcher(line);
    pattern = null; // all the =nulls were intended to help memory management, had no effect

    int index = 0;

    // for each occurrence of 'sub' in the line being scanned
    while (matcher.find(index)) {
    int start = matcher.start(); // get the index of the next occurrence
    int[] occurrence = {line_index, start}; // make up the coordinate array
    occurrences.add(occurrence); // store that occurrence
    index = start+1; // start looking from after the last occurence found
    }
    matcher=null;
    line=null;
    line_index++;

}
s=null;

return occurrences;
}

我在不同速度的几台计算机上尝试过这个程序,而在更快的计算机上完成search_sequence的实际时间完成时,相对时间是相同的;在大约相同的迭代次数下,search_sequence开始花费十倍的时间来完成。

我已经尝试使用谷歌搜索内存效率和不同输入流的速度,如BufferedReader等,但普遍的共识似乎是它们大致相当于扫描仪。你们有没有任何关于这个bug是什么的建议,或者我怎么试着自己解决这个问题?

如果有人想要查看更多代码,请询问。

编辑:

1 - 输入文件'序列'是1000个蛋白质序列(每个在一条线上),长度大约为几百个字符。我还应该提到这个程序将/只需要工作/最多9个模式。

2 - 以下是上述代码中使用的Substring类方法

static class Substring {
int residue; // position of the middle character C2
char front, mid, end; // alphabet characters for C1, C2 and C3
ArrayList<int[]> occurrences; // list of positions the substring occurs in 'sequences'
String string; // string representation of the substring

public Substring(int inresidue, char infront, char inmid, char inend) {
    occurrences = new ArrayList<int[]>();
    residue = inresidue;
    front = infront;
    mid = inmid;
    end = inend;
    setString(); // makes the string representation using characters and their positions
}

/* gets the frequency of the substring given the places it occurs in 'sequences'. 
   This only counts the substring /once per line ist occurs in/. */
public int frequency() {
    return PatternFinder.frequency(occurrences);
}

public String toString() {
    return string;
}

/* makes the string representation using the substring's characters and their positions */
private void setString() {
    if (residue>-1) {
    String left_mid = "";
    for (int j = 0; j < residue-1; j++) left_mid += ".";
    String right_mid = "";
    for (int j = residue+1; j < length-1; j++) right_mid += ".";
    string = front + left_mid + mid + right_mid + end;
    } else {
    String mid = "";
    for (int i = 0; i < length-2; i++) mid += ".";
    string = front + mid + end;
    }
}
 }

...和PatternFinder.frequency方法(在Substring.frequency()中调用):

public static int frequency(ArrayList<int[]> occurrences) {
    HashSet<String> lines_present = new HashSet<String>();
    for (int[] occurrence : occurrences) {
        lines_present.add(new String(occurrence[0]+""));
    }
    return lines_present.size();
    }

2 个答案:

答案 0 :(得分:0)

什么是字母?你给它什么样的正则表达式?您是否检查过您要存储的事件数量?由于您正在进行指数级的搜索,因此简单地存储事件就足以使其耗尽内存。

听起来您的算法具有隐藏的指数资源使用情况。你需要重新考虑你想要做的事情。

此外,将局部变量设置为null将无济于事,因为JVM已经进行了数据流和活跃度分析。

修改:Here's a page that explains how even short regexes can take an exponential amount of time to run.

答案 1 :(得分:0)

我无法发现明显的内存泄漏,但您的程序确实存在许多低效率。以下是一些建议:

  1. 正确缩进代码。无论是为了你还是为了其他人,阅读它都会变得更容易。目前的形式很难阅读。
  2. 如果您指的是成员变量,请在其前面添加this.,否则代码段的读者将无法确切知道您所指的是什么。
  3. 避免使用静态成员和方法,除非它们绝对必要。在引用它们时,请使用Classname.membername表单,原因相同。
  4. frequency()的代码与return occurrences.size()的代码有什么不同?
  5. search_sequences()中,正则表达式字符串sub是常量。你只需要编译一次,但是你要为每一行重新编译它。
  6. 将输入字符串(sequences)拆分为一次,然后将它们存储在数组或ArrayList中。不要在search_sequences()内重新拆分,通过拆分集合。
  7. 可能还有更多需要修复的东西,但这是跳出来的列表。

    解决所有这些问题,如果您仍有问题,可能需要使用分析器来了解发生的情况。