solr获取pdf中特定关键字或短语的总数

时间:2016-04-30 07:43:15

标签: pdf search solr search-engine keyword

我是solr的新手,我正在尝试获取PDF文档中特定关键字或短语出现的总数。

我通过下载solr-6.0.0.zip,解压缩并运行solr-6.0.0/bin/solr start -e cloud -noprompt进行了简单的安装。

然后我从http://www.orimi.com/pdf-test.pdf下载测试pdf文件,并使用以下命令将其发布到sorl以进行索引:

solr-6.0.0/bin/post -c gettingstarted ../pdf-test.pdf

结果如下:

java -classpath /home/test/solr-6.0.0/dist/solr-core-6.0.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool ../pdf-test.pdf
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file pdf-test.pdf (application/pdf) to [base]/extract
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:04.450

当我对所有文档运行简单查询时,我看到我的pdf-test.pdf被编入索引:

http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":26,
    "params":{
      "q":"*:*",
      "indent":"on",
      "wt":"json"}},
  "response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
      {
        "id":"/home/test/solr-6.0.0/../pdf-test.pdf",
        "date":["2008-06-04T22:47:36Z"],
        "pdf_pdfversion":[1.6],
        "xmp_creatortool":["Acrobat PDFMaker 7.0.7 for Word"],
        "company":["Government of Yukon"],
        "stream_content_type":["application/pdf"],
        "dc_creator":["Yukon",
          "Canada",
          "Yukon Department of Education"],
        "dcterms_created":["2008-06-04T22:44:00Z"],
        "last_modified":["2008-06-04T22:47:36Z"],
        "dcterms_modified":["2008-06-04T22:47:36Z"],
        "dc_format":["application/pdf; version=1.6"],
        "title":[" PDF Test Page"],
        "last_save_date":["2008-06-04T22:47:36Z"],
        "meta_save_date":["2008-06-04T22:47:36Z"],
        "pdf_encrypted":[true],
        "dc_title":[" PDF Test Page"],
        "modified":["2008-06-04T22:47:36Z"],
        "content_type":["application/pdf"],
        "stream_size":[20597],
        "x_parsed_by":["org.apache.tika.parser.DefaultParser",
          "org.apache.tika.parser.pdf.PDFParser"],
        "creator":["Yukon",
          "Canada",
          "Yukon Department of Education"],
        "meta_author":["Yukon",
          "Canada",
          "Yukon Department of Education"],
        "meta_creation_date":["2008-06-04T22:44:00Z"],
        "created":["Wed Jun 04 22:44:00 UTC 2008"],
        "xmptpg_npages":[1],
        "creation_date":["2008-06-04T22:44:00Z"],
        "resourcename":["/home/test/solr-6.0.0/../pdf-test.pdf"],
        "sourcemodified":["D:20080604224256"],
        "author":["Yukon",
          "Canada",
          "Yukon Department of Education"],
        "producer":["Acrobat Distiller 7.0.5 (Windows)"],
        "_version_":1533019696879632384}]
  }}

现在我想举例说明关键字PDF的总出现次数。

我知道PDF文件中出现 4 关键字pdf-test.pdf

如何在该简单设置中查询以获得此结果?它应该怎么样?或者更一般地说,如何查询从索引的pdf文件中获取关键字或短语的出现列表?

0 个答案:

没有答案