Question

我有大量文件（超过一百万），我需要定期扫描并匹配大约100＆＃34;多字关键字＆＃34; （即不仅仅是关键词如＆＃34;电影＆＃34;还有＆＃34;北美＆＃34;）。我有以下代码，可以使用单个单词关键字（即＆＃34; book＆＃34;）：

/** 
 * Scan a text for certain keywords
 * @param keywords the list of keywords we are searching for
 * @param text the text we will be scanning
 * @return a list of any keywords from the list which we could find in the text
 */
public static List<String> scanWords(List<String> keywords, String text) {

    // prepare the BreakIterator
    BreakIterator wb = BreakIterator.getWordInstance();
    wb.setText(text);

    List<String> results = new ArrayList<String>();

    // iterate word by word
    int start = wb.first();
    for (int end = wb.next(); end != BreakIterator.DONE; start = end, end = wb.next()) {

        String word = text.substring(start, end);

        if (!StringUtils.isEmpty(word) && keywords.contains(word)){

            // we have this word in our keywords so return it
            results.add(word);
        }
    }

    return results;
}

注意：我需要此代码尽可能高效，因为文档数量非常大。

我当前的代码无法找到2个关键字关键字中的任何一个。有关如何修复的任何想法？我也采用了完全不同的方法。

Answer 1

扫描每个文件根本不会扩展。更好地在inverted index索引您的文档或者在评论中使用Lucene。

Answer 2

我认为创建Scanner的实例会对此有所帮助。 Scanner类有一个方法，允许您搜索文本中的模式，这将是您案例中的单词。

Scanner scanner=new Scanner(text);
while(scanner.hasNext()){
    scanner.findInLine(String pattern);
    scanner.next();
}

Scanner课程很适合做这样的事情，而且我相信它可以满足您的需求。

扫描大量文档数十个字

2 个答案: