PyLucene Indexer和检索器样本

时间:2017-12-06 06:20:09

标签: python python-3.x lucene pylucene

我是Lucene的新手。我想在Python 3中编写PyLucene 6.5的示例代码。我更改了this版本的示例代码。但是,我找不到文件,我不确定这些更改是否正确。

# indexer.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, StringField, FieldType
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    writerConfig = IndexWriterConfig(StandardAnalyzer())
    writer = IndexWriter(indexDir, writerConfig)

    print("%d docs in index" % writer.numDocs())
    print("Reading lines from sys.stdin...")

    tft = FieldType()
    tft.setStored(True)
    tft.setTokenized(True)
    for n, l in enumerate(sys.stdin):
        doc = Document()
        doc.add(Field("text", l, tft))
        writer.addDocument(doc)
    print("Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs()))
    print("Closing index of %d docs..." % writer.numDocs())
    writer.close()

此代码读取输入并存储在index目录中。

# retriever.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import IndexReader, DirectoryReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    analyzer = StandardAnalyzer()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    reader = DirectoryReader.open(indexDir)
    searcher = IndexSearcher(reader)

    query = QueryParser("text", analyzer).parse("hello")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print("Found %d document(s) that matched query '%s':" % (hits.totalHits, query))
    for hit in hits.scoreDocs:
        print(hit.score, hit.doc, hit.toString())
        doc = searcher.doc(hit.doc)
        print(doc.get("text").encode("utf-8"))

我们应该能够使用retriever.py检索(搜索),但它不会返回任何内容。怎么了?

2 个答案:

答案 0 :(得分:0)

In []: tft.indexOptions()
Out[]: <IndexOptions: NONE>

虽然记录了DOCS_AND_FREQS_AND_POSITIONS是默认值,但情况已不再如此。这是TextField的默认值; FieldType必须明确setIndexOptions

答案 1 :(得分:0)

我认为,入门的最佳方法是下载PyLucene的tarball(您选择的版本):

https://www.apache.org/dist/lucene/pylucene/

在内部,您将找到一个带有python测试的test3/文件夹(用于python3,否则为test2/)。这些内容涵盖了常见的操作,例如索引,读取,搜索等等。鉴于缺乏有关Pylucene的文档,我发现这些方法非常有帮助。

特别是检出test_Pylucene.py

注意

如果Changelog对您而言不够直观,这也是一种快速掌握更改并在各个发行版之间调整代码的好方法。

为什么我不在此答案中提供代码:在SO的PyLucene答案中提供代码段的问题是,随着新版本的发布,这些代码段很快就过时了。可以在大多数现有的设备上看到。)