在Lucene索引中使用RegexQuery(不是任何其他)搜索“$”

时间:2012-07-03 16:50:26

标签: java lucene

我有以下程序:

public class RegexQueryExample {

    public static String[] terms = {
        "US $65M dollars",
        "USA",
        "$35",
        "355",
        "US $33",
        "U.S.A",
        "John Keates",
        "Tom Dick Harry",
        "Southeast' Asia"
    };
    private static Directory directory;

    public static void main(String[] args) throws CorruptIndexException, IOException {
        String searchString = ".*\\$.*";
        createIndex();
        searchRegexIndex(searchString);
    }

    /**
     * Creates an index for the files in the data directory.
     */
    private static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {

        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
        directory = new RAMDirectory();
        IndexWriter indexWriter = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);


        for (String term : terms) {
            Document document = new Document();
            if (term.indexOf('$') >= 0) {
                document.add(new Field("type", "currency", Field.Store.YES, Field.Index.NOT_ANALYZED));
            } else {
                document.add(new Field("type", "simple_field", Field.Store.YES, Field.Index.NOT_ANALYZED));
            }
            document.add(new Field("term", term, Field.Store.YES, Field.Index.NOT_ANALYZED));
            indexWriter.addDocument(document);
        }

        indexWriter.close();
    }

    /**
     * searches for a regular expression satisfied by a file path.
     *
     * @param searchString the string to be searched.
     */
    private static void searchRegexIndex(String regexString) throws CorruptIndexException, IOException {
        regexString = regexString;
        IndexSearcher searcher = new IndexSearcher(directory);

        RegexQuery rquery = new RegexQuery(new Term("term", regexString));
        BooleanQuery queryin = new BooleanQuery();
        BooleanQuery query = new BooleanQuery();
        query.add(new TermQuery(new Term("type", "simple_field")), BooleanClause.Occur.MUST);
        query.add(rquery, BooleanClause.Occur.MUST);

        TopDocs hits = searcher.search(query, terms.length);
        ScoreDoc[] alldocs = hits.scoreDocs;
        for (int i = 0; i < alldocs.length; i++) {
            Document d = searcher.doc(alldocs[i].doc);
            System.out.println((i + 1) + ". " + d.get("term"));
        }
    }
}

createIndex()函数创建Lucene索引,而searchRegexIndex()执行正则表达式查询。在main()函数中,我搜索.*\\$.*,希望它返回包含$符号的字词。但是,它没有用。我如何使其工作?这是分析仪的一些问题吗?

修改

来自Luke的我的Lucene索引快照:

Lucene Index

1 个答案:

答案 0 :(得分:4)

您正在使用StandardAnalyzer,它会从令牌中删除美元符号。例如。 &#34; 6500万美元&#34;成为三个代币:&#34;我们&#34;,&#34; 65m&#34;,&#34;美元&#34;。您需要使用另一个不会删除美元符号的分析仪。 Luke提供了一个出色的分析工具,您可以在其中试用不同的分析仪并检查其输出。