Question

我正在使用坚果1.13和SOLR 5.5 在大多数情况下，URL字段= ID字段（在SOLR中索引文档时）但是我看到了ID与URL字段不同的情况，当URL1重定向到URL 2并提取URL2时会发生这种情况有两种情况

第一种情况（id不等于URL，https://www.givaudan.com/files/giv-2018-integrated-annual-report.pdf(repr元数据）被用作URL，https://www.givaudan.com/file/149296/download被用作solr中的ID

https://www.givaudan.com/files/giv-2018-integrated-annual-report.pdf     
Version: 7
Status: 4 (db_redir_temp)
Fetch time: Thu Mar 07 07:18:53 UTC 2019
Modified time: Tue Feb 05 07:18:53 UTC 2019
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0013103343
Signature: 989b82c1e6e738b74f36d64534f95050
Metadata: 
_pst_=temp_moved(13), lastModified=0: 
https://www.givaudan.com/file/149296/download
_rs_=2508
Content-Type=text/html
nutch.protocol.code=302




https://www.givaudan.com/file/149296/download   Version: 7
Status: 2 (db_fetched)
Fetch time: Thu Mar 07 07:19:08 UTC 2019
Modified time: Tue Feb 05 07:19:08 UTC 2019
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0029494818
Signature: 7ecff30181eb4268cfb1dd0b79df7e8a
Metadata: 
_repr_=https://www.givaudan.com/files/giv-2018-integrated-annual-report.pdf
_pst_=success(1), lastModified=1549351146000
_rs_=14411
Content-Type=application/pdf
nutch.protocol.code=200

第二种情况（id与URL相同）没有repr元数据

https://www.givaudan.com/files/giv-2017-annual-report.pdf   Version: 7
Status: 4 (db_redir_temp)
Fetch time: Thu Mar 07 07:18:14 UTC 2019
Modified time: Tue Feb 05 07:18:14 UTC 2019
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0012841906
Signature: e47ac79e3f75007a0c89490e7e2bbdbd
Metadata: 
_pst_=temp_moved(13), lastModified=0: 
https://www.givaudan.com/file/86431/download
_rs_=2537
Content-Type=text/html
nutch.protocol.code=302


https://www.givaudan.com/file/86431/download    Version: 7
Status: 2 (db_fetched)
Fetch time: Thu Mar 07 07:19:46 UTC 2019
Modified time: Tue Feb 05 07:19:46 UTC 2019
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 5.633987E-4
Signature: 03a2542baa11916676f438c662e58b2e
Metadata: 
_pst_=success(1), lastModified=1549350016000
_rs_=5620
Content-Type=application/pdf
nutch.protocol.code=200

repr metada标签指示什么仅对db_fetched状态URL进行索引？目前，我正在使用URL从SOLR中搜索文档，该文件对大多数url都适用，但是像上述的几种情况很少，而且https://www.givaudan.com/file/149296/download并没有任何结果我应该使用ID而不是URL从SOLR提取数据，这会造成任何问题

Answer 1

根据设计，Nutch只能在成功获取了URL的URL下建立索引文档（HTTP状态200）。在您的示例中，URL以.../download结尾。对于重定向，存在some heuristics来找到最具代表性的URL，在第一个示例中，选择https://www.givaudan.com/files/giv-2017-annual-report.pdf作为_repr_上的https://www.givaudan.com/file/149296/download URL（我认为这不是一个坏选择）。但是，如果重定向目标URL较早地作为普通链接找到或已作为种子注入，则试探法不起作用。

我应该使用ID而不是URL从SOLR中获取数据

是的，因为它不会改变。但是您可以使用“ URL”字段在搜索结果页面上显示为链接。

索取重定向网址的网址时，SOLR中提交的ID与网址不同

1 个答案: