如何处理"文档包含至少一个巨大的术语"在SOLR?

时间:2016-05-06 10:54:28

标签: solr lucene

在LUCENE-5472中,如果术语太长,Lucene会更改为抛出错误,而不是仅记录消息。此错误表明SOLR不接受大于32766的令牌

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 10, 70, 111, 117, 110, 100, 32, 116, 104, 105, 115, 32, 111, 110, 32, 116, 104, 101, 32, 119, 101, 98, 32, 104, 111, 112, 101, 32, 116]...', original message: bytes can be at most 32766 in length; got 43225
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:671)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
    ... 54 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 43225
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)

尝试解决此问题,我在架构中添加了两个过滤器(粗体):

<field name="text" type="text_en_splitting" termPositions="true" termOffsets="true" termVectors="true" indexed="true" required="false" stored="true"/>
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
            />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
**<filter class="solr.TruncateTokenFilterFactory" prefixLength="32700"/>
<filter class="solr.LengthFilterFactory" min="2" max="32700" />**
</analyzer>
</fieldType>

由于错误仍然相同(这让我觉得过滤器没有正确设置,可能吗?) 重启服务器是关键,感谢Bashetti先生

问题是哪一个更好:LengthFilterFactoryTruncateTokenFilterFactory?并且假设一个字节是一个字符是正确的(因为过滤器应该删除&#39;不常见的字符?) 谢谢!

2 个答案:

答案 0 :(得分:2)

错误表示"SOLR doesn't accept token larger than 32766"

问题是因为您之前已经为字段文本使用了String fieldType,并且在更改字段类型后出现了同样的问题,因为您在更改后没有重新启动solr服务器。

我认为无需添加TruncateTokenFilterFactoryLengthFilterFactory

但是你和你的最新要求仍然存在。

答案 1 :(得分:0)

使用bin / post工具将(http://jsonstudio.com/wp-content/uploads/2014/02/enron.zip)的enron_email数据加载到solr-6.0.0时,我遇到了同样的错误

见下面的摘录

  

...在字段= \“text \”(其UTF8中)至少包含一个巨大的术语   编码长度超过最大长度32766),所有这些都是   跳过。请更正分析仪以不生成此类条款。该   第一个巨大术语的前缀是:'[78,97,110,99,121,32,83,   104,101,101,100,32,60,110,97,110,99,121,95,115,104,101,   101,100,64,66,85,83,73,78] ......',原始消息:字节可以   最多32766个;得到43172.也许该文件有一个   索引字符串字段(solr.StrField)太大“,”代码“:400}}   SimplePostTool:警告:读取响应时出现IOException:   java.io.IOException:服务器返回HTTP响应代码:400为URL:   http://localhost:8983/solr/enron_emails/update/json/docs ......

<强> RCA 架构字段名称“text”使用fieldtype字符串限制为32766。 为了接受长度大于32766的数据/ clobs,必须将字段类型更改为text_general

<强>解决方案

A)在独立模式下

编辑文件$ SOLR_HOME / server / solr / core_name / conf / managed-schema,
我改变了

<field name="text" type="strings"/><field name="text" type="text_general"/>

B)在Solrcloud模式下(因为托管模式文件将在嵌入式或自己的zookeeper中)

# check collections field "text" definition 
curl "http://localhost:8983/solr/enron_emails/schema/fields/text?wt=json&indent=true"

# Modify collections field "text" definition 
curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
     "name":"text",
     "type":"text_general",
     "stored":false } }' "http://localhost:8983/solr/enron_emails/schema?wt=json&indent=true"

# Verify  collections field "text" new definition 
curl "http://localhost:8983/solr/enron_emails/schema/fields/text?wt=json&indent=true"