在SOLR中将文档添加到索引:文档至少包含一个巨大的术语

时间:2015-04-04 10:27:57

标签: solr

我正在(通过Java程序)添加索引,SOLR索引中的文档,但在add(inputDoc)方法之后有一个例外。登录solr Web界面包含以下内容:

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[99, 111, 112, 101, 114, 116, 105, 110, 97, 32, 105, 110, 102, 111, 114, 109, 97, 122, 105, 111, 110, 105, 32, 113, 117, 101, 115, 116, 111, 32]...', original message: bytes can be at most 32766 in length; got 226781
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:457)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
    ... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 226781
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:663)
    ... 47 more

请问我该怎么做才能解决这个问题?

2 个答案:

答案 0 :(得分:15)

我和你的问题一样,最后我解决了我的问题。请检查您的"文字"的类型。字段,我怀疑它必须是"字符串"。

您可以在核心的托管架构中找到它:

<field name="text" type="strings"/>

或者您可以访问Solr管理员,访问:http://localhost:8983/solr/CORE_NAME/schema/fieldtypes?wt=json,然后搜索&#34;文字&#34;,如果它类似于以下内容,您知道您定义了&#34;文字&# 34;字段为字符串类型:

  {
  "name":"strings",
  "class":"solr.StrField",
  "multiValued":true,
  "sortMissingLast":true,
  "fields":["text"],
  "dynamicFields":["*_ss"]},

然后我的解决方案适合您,您可以更改&#34;字符串&#34;到&#34; text_general&#34;在托管架构中。 (确保 schema.xml 中&#34; text&#34;的类型也是&#34; text_general&#34;)

   <field name="text" type="text_general">

这将解决您的问题。 strings是字符串字段,但text_general是文本字段。

答案 1 :(得分:5)

你可能遇到了LUCENE-5472 [1]中描述的内容。在那里,如果术语太长,Lucene会抛出错误。你可以:

  • 使用(在索引分析器中),使用LengthFilterFactory [2]来过滤掉那些不符合要求长度范围的令牌

  • 使用(在索引分析器中),TruncateTokenFilterFactory [3]来修复索引标记的最大长度

  • 使用自定义UpdateRequestProcessor,但这实际上取决于您的上下文

[1] https://issues.apache.org/jira/browse/LUCENE-5472
[2] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
[3] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.TruncateTokenFilterFactory [4] https://wiki.apache.org/solr/UpdateRequestProcessor