我试图通过https从aws常见抓取中下载warc文件,这是有效的,但出于某种原因,当我最近尝试时,我不断收到the specified key does not exist
错误。
当我测试特定网址的索引时,我确实得到了响应,但是当我尝试为每条记录下载warc时,我得到了错误。
这里是要测试的索引url http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=fivethirtyeight.com&matchType=domain&output=json,它显示了许多格式如下的记录:
{"urlkey": "com,fivethirtyeight)/", "timestamp": "20150228172316", "url": "http://fivethirtyeight.com/", "length": "17426", "filename": "crawl-data/CC-MAIN-2015-11/segments/1424936462009.45/warc/CC-MAIN-20150226074102-00094-ip-10-28-5-156.ec2.internal.warc.gz", "digest": "FXI6SYLZSAFRSUOIKOZ6XVMQW2NHHLZK", "offset": "96230370"}
这是我试图用来下载warc进行记录的网址:https://aws-publicdatasets.s3.amazonaws.com/crawl-data/CC-MAIN-2015-11/segments/1424936462009.45/warc/CC-MAIN-20150226074102-00094-ip-10-28-5-156.ec2.internal.warc.gz
我错过了一些非常明显的东西吗?答案 0 :(得分:1)
基于this ...
发布于:s3:// aws-publicdatasets / common-crawl /
...看起来您错过了/common-crawl
的路径前缀,因此我建议正确的网址为https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-11/...