Maven上的Lucene-java.lang.IllegalArgumentException UTF8编码长于最大长度32766错误

时间:2019-04-14 08:45:56

标签: java apache maven lucene

我正在尝试使用Lucene Maven索引超出字符串长度限制的大型文档。然后,我收到此错误。

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[65, 32, 98, 101, 110, 122, 111, 100, 105, 97, 122, 101, 112, 105, 110, 101, 32, 91, 116, 112, 108, 93, 73, 80, 65, 99, 45, 101, 110, 124]...', original message: bytes can be at most 32766 in length; got 85391

代码如下(它是http://lucenetutorial.com/lucene-in-5-minutes.html的副本,但从文件中读取文档的改动很小。):

File file = "doc.txt";

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
Scanner scanner = new Scanner(file))
     while (scanner.hasNextLine())
     {
          String line = scanner.nextLine();
          doc.add(new StringField("content", line, Field.Store.YES));
          w.addDocument(doc);
     }

...

还有其他与我遇到的问题相同的帖子,但它们是SOLR或Elasticsearch的解决方案,不是Maven上的Lucene的解决方案,因此我不确定如何解决此问题。

请问有人可以引导我到正确的位置来解决此问题吗?

谢谢。

1 个答案:

答案 0 :(得分:1)

如果您要为文本而不是单个单词编制索引,则应使用可以将文本分解为单词的内容,例如qr_code