我是Solr的新手,只想尝试索引几个PDF文件。从schema.xml中的空字段列表开始,我不断收到错误消息:
引起:org.apache.solr.common.SolrException:错误:[doc =#docid] unknown field'#fieldname'
(#docid和#fieldname是实际值的占位符)
有没有办法找出我的PDF文件中的所有字段?相互添加并不太有趣:)
在加载到Solr之前过滤这些的最佳方法是什么? schema.xml似乎是最后一个选项。有没有配置文件,我可以摆脱垃圾场 更快,可能会提高性能?
我的环境:Cloudera Quickstart VM with CDH 5
Thansk提前帮助你。
答案 0 :(得分:1)
您需要查看ExtractingRequestHandler(又名SolrCell)及其配置。这里有一个例子,说明如何使用uprefix
忽略模式未知的所有字段:
示例:
uprefix=ignored_
会有效地忽略所有未知字段 由Tika生成的示例模式包含<dynamicField name="ignored_*" type="ignored"/>
示例模式中还定义了一个字段列表,列出了SolrCell及其类型的所有预期值:
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them. Some metadata is parsed from the documents,
but there are some which come from the client context:
"content_type": From the HTTP headers of incoming stream
"resourcename": From SolrCell request param resource.name
-->
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>