我是solr的新手,我正在尝试获取PDF文档中特定关键字或短语出现的总数。
我通过下载solr-6.0.0.zip
,解压缩并运行solr-6.0.0/bin/solr start -e cloud -noprompt
进行了简单的安装。
然后我从http://www.orimi.com/pdf-test.pdf下载测试pdf文件,并使用以下命令将其发布到sorl以进行索引:
solr-6.0.0/bin/post -c gettingstarted ../pdf-test.pdf
结果如下:
java -classpath /home/test/solr-6.0.0/dist/solr-core-6.0.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool ../pdf-test.pdf
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file pdf-test.pdf (application/pdf) to [base]/extract
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
Time spent: 0:00:04.450
当我对所有文档运行简单查询时,我看到我的pdf-test.pdf
被编入索引:
http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":26,
"params":{
"q":"*:*",
"indent":"on",
"wt":"json"}},
"response":{"numFound":1,"start":0,"maxScore":1.0,"docs":[
{
"id":"/home/test/solr-6.0.0/../pdf-test.pdf",
"date":["2008-06-04T22:47:36Z"],
"pdf_pdfversion":[1.6],
"xmp_creatortool":["Acrobat PDFMaker 7.0.7 for Word"],
"company":["Government of Yukon"],
"stream_content_type":["application/pdf"],
"dc_creator":["Yukon",
"Canada",
"Yukon Department of Education"],
"dcterms_created":["2008-06-04T22:44:00Z"],
"last_modified":["2008-06-04T22:47:36Z"],
"dcterms_modified":["2008-06-04T22:47:36Z"],
"dc_format":["application/pdf; version=1.6"],
"title":[" PDF Test Page"],
"last_save_date":["2008-06-04T22:47:36Z"],
"meta_save_date":["2008-06-04T22:47:36Z"],
"pdf_encrypted":[true],
"dc_title":[" PDF Test Page"],
"modified":["2008-06-04T22:47:36Z"],
"content_type":["application/pdf"],
"stream_size":[20597],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"],
"creator":["Yukon",
"Canada",
"Yukon Department of Education"],
"meta_author":["Yukon",
"Canada",
"Yukon Department of Education"],
"meta_creation_date":["2008-06-04T22:44:00Z"],
"created":["Wed Jun 04 22:44:00 UTC 2008"],
"xmptpg_npages":[1],
"creation_date":["2008-06-04T22:44:00Z"],
"resourcename":["/home/test/solr-6.0.0/../pdf-test.pdf"],
"sourcemodified":["D:20080604224256"],
"author":["Yukon",
"Canada",
"Yukon Department of Education"],
"producer":["Acrobat Distiller 7.0.5 (Windows)"],
"_version_":1533019696879632384}]
}}
现在我想举例说明关键字PDF
的总出现次数。
我知道PDF
文件中出现 4 关键字pdf-test.pdf
。
如何在该简单设置中查询以获得此结果?它应该怎么样?或者更一般地说,如何查询从索引的pdf文件中获取关键字或短语的出现列表?