python - 下载通用爬网完整索引文件

通用抓取索引文件可在@echo off title Store Data Counter :recurse set I=1 echo "files counter" FOR /f "tokens=*" %%A IN ('dir /a-d /b "Z:\StoreData\A11106*.zip"') do (call :showfiles "%%A") echo A111: %I% FOR /f "tokens=1" %%A IN ('dir /a-d /b "Z:\StoreData\A11206*.zip"') do (call :showfiles "%%A") echo A112: %I% pause goto :eof :showfiles echo %1 set /a I+=1 goto :eof

公开获取

您可以查看 aws命令行提供的所有抓取索引：s3://commoncrawl/cc-index/collections/

2015年4月的索引文件位于aws s3 ls s3://commoncrawl/cc-index/collections/

如果要通过http协议下载索引s3://commoncrawl/cc-index/collections/CC-MAIN-2015-18/indexes/文件，可以执行以下操作：

https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2015-18/indexes/cdx-00000.gz

cdx文件大部分来自 cdx-00000.gz 至 cdx-00299.gz ，因此完整索引包含在300个文件中。

下载通用爬网完整索引文件

1 个答案: