如何使用apache solr从pdf的内容中获取日期字符串

时间:2012-11-23 09:25:44

标签: apache solr solr-cell

大家好我是apache solr的新手。我有一个pdf,其中包含日期信息,如bla bla bla 2012-11-23 11:11:12 bla bla ...-我想从内容中获取所有日期。

我阅读了一些文档(http://wiki.apache.org/solr/ExtractingRequestHandler),并将date.formats添加到/ update / extract

 <requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <!-- All the main content goes into "text"... if you need to return
       the extracted text or do highlighting, use a stored field. -->
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>
<lst name="date.formats">
  <str>yyyy-MM-dd</str>
  <str>yyyy-MM-dd'T'HH:mm:ss'Z'</str>
  <str>yyyy-MM-dd'T'HH:mm:ss</str>
  <str>yyyy-MM-dd</str>
  <str>yyyy-MM-dd hh:mm:ss</str>
  <str>yyyy-MM-dd HH:mm:ss</str>
</lst>

我正在添加pdf,如下所示

curl“http:// localhost:8983 / solr / update / extract?literal.id = sql.txt&amp; uprefix = attr_&amp; fmap.content = attr_content&amp; commit = true”&amp; stream.file =“/ home /example/example.pdf“

并且有关于约会的注意事项?和内容?

Thnks

0 个答案:

没有答案