Question

对于我的一个统计项目，我需要随机从谷歌专利页面下载几个文件，每个文件都是一个大型zip文件。网络链接如下：

http://www.google.com/googlebooks/uspto-patents-grants-text.html#2012

具体来说，我想随意选择5年（页面顶部的链接）和下载（即5个文件）。难道你们知道那里有一些好的包装对这个任务有好处吗？

谢谢。

Answer 1

该页面主要包含zip文件并查看HTML内容，通过简单地在候选URL集合中搜索*.zip来确定哪些链接将产生zip文件应该相当容易，所以这是我的建议：

fetch the page
parse the HTML
extract the anchor tags
for each anchor tag
    if href of anchor tag contaings "*.zip"
        add href to list of file links

while more files needed
    generate a random index i, such that i is between 0 and num links in list
    select i-th element from the links list
    fetch the zip file
    save the file to disk or load it in memory

如果您不想两次获得相同的文件，那么只需从链接列表中删除该URL并随机选择另一个索引（直到您有足够的文件或直到您的链接用完为止）。我不知道你的团队编码的编程语言是什么，但写一个完成上述工作的小程序应该不是很困难。

有哪些好的抓取工具可以帮助下载文件

1 个答案: