Java - 搜索长度为1亿个字符的文本?

时间:2014-07-29 13:40:44

标签: java

我想搜索一个文本文档(或多个文本文档),其中字符数总计可能达到1亿个字符+。

我使用Java,是否有一种简单的方法可以在不使用太多内存的情况下解决这个问题? 它将在Android设备上运行,因此我希望尽可能少地使用内存。

我可以使用if(source.contains(phrase)){}等字符串函数。我已经计算出计算时间不长,但占用的内存很多。

以下是一些结果: 要搜索的字符串?

"FADE OUT."
312,719 - source length.
62,543,800 - source length multiplied by 200.
1) Phrase found in 6 ms - searched 312,719 characters. Used 261 mb.
2) Phrase found in 1 ms - searched 625,447 characters. Used 269 mb.
3) Phrase found in 0 ms - searched 1,250,903 characters. Used 284 mb.
4) Phrase found in 0 ms - searched 2,501,815 characters. Used 315 mb.
5) Phrase found in 0 ms - searched 5,003,639 characters. Used 33 mb.
6) Phrase found in 0 ms - searched 1,0007,287 characters. Used 159 mb.
7) Phrase found in 0 ms - searched 20,014,583 characters. Used 114 mb.
8) Phrase found in 0 ms - searched 40,029,175 characters. Used 229 mb.
9) Phrase found in 0 ms - searched 80,058,359 characters. Used 763 mb.
10) Phrase found in 0 ms - searched 160,116,727 characters. Used 916 mb.

源长度是我搜索的文本文件的平均大小。我将它乘以200以获得200个文本文件的平均值。

那么如何在不使用ram的情况下搜索文本文件呢?

2 个答案:

答案 0 :(得分:2)

这是一个非常简单的算法,类似于RabinKarp(RabinKarp更有效但当然要复杂得多)方法find返回第一次出现的提供短语的索引。({ {3}})

public class SearchForPhrase {

    static int hash(String phrase) {
        int hash = 0;
        for (int i = 0; i < phrase.length(); i++) {
            hash += phrase.codePointAt(i);
        }
        return hash;
    }

    static boolean equals(Deque<Character> txt, String phrase) {
        int i = 0;
        for (Character c : txt) {
            if (!c.equals(phrase.charAt(i++))) {
                return false;
            }
        }
        return true;
    }

    static int find(String phrase, Reader in) throws Exception {

        int phash = hash(phrase);
        int hash;

        BufferedReader bin = new BufferedReader(in);
        char[] buffer = new char[phrase.length()];

        int readed = bin.read(buffer);

        if (readed < phrase.length()) {
            return -1;
        }

        String tmp = new String(buffer);
        hash = hash(tmp);
        if (hash == phash && tmp.equals(phrase)) {
            return 0;
        }

        Deque<Character> queue = new LinkedList<>();
        for (char c : buffer) {
            queue.add(c);
        }

        int curr;
        int index = 1;
        while ((curr = bin.read()) != -1) {

            hash = hash - queue.removeFirst() + curr;
            queue.add((char) curr);

            if (hash == phash && equals(queue, phrase)) {
                return index;
            }

            index++;

        }

        return -1;

    }

    public static void main(String[] args) throws Exception {

        StringWriter writer = new StringWriter();
        PrintWriter out = new PrintWriter(writer);
        out.println("Discuss the person's qualifications for the graduate study in the chosen field. Statements of past");
        out.println("performance, accomplishments, and contributions are helpful. The more relevant the items mentioned, andd");
        out.flush();

        System.out
                .println(find("Discuss", new StringReader(writer.toString())));
        System.out.println(find("the", new StringReader(writer.toString())));
        System.out.println(find("qualifications",
                new StringReader(writer.toString())));
        System.out.println(find("andd", new StringReader(writer.toString())));

    }

}

输出:

0
8
21
199

答案 1 :(得分:0)

您可以使用InputStream(或Reader)来阅读您的代码,这就是Streams通常用于的代码。在这种情况下,使用具有搜索字符串长度的字符列表来读取字符,并丢弃您不再需要的每个字符。你可以这样做:

Reader in = new Reader(...);
String searchStr = "search string";
StringBuilder sb = new StringBuilder(searchStr.length());
// start reading
char read;
while ((read = in.read()) != -1)
{
    if (sb.length() == searchStr.length()) sb.deleteCharAt(0);
    sb.append(read);
    if (sb.toString().equals(searchStr))
    {
        System.out.println("Match found!");
        break; // stop reading if you only need one match
    }
}

为此分配的唯一内存将是searchStr.length() * 2,因此如果您没有很长的搜索字符串,则不需要太多内存。