Question

到目前为止，我已经看到了很多工具，例如Scrapy或Selenium。基本上，问题不在于如何抓取网站，而在于在尊重robots.txt和互联网礼节性的同时，如何在相当长的时间内抓取数百万个网站。

到目前为止，我已经收集了超过十亿个url，但是现在我需要抓取它们中的每个以获取“ title”和“ metatags”。

这可能吗？如何？哪种工具可以让我抓取多个网址而不会被网站阻止或禁止？

谢谢

Answer 1

因此，我在这里提供全面的解决方案。使用requests和BeautifulSoup库将是您的最佳解决方案。

首先，我假设有十亿个URL 作为列表。您的目标是从这些网站获取title和meta的内容。

import requests
from bs4 import BeautifulSoup

urls = ['http://github.com', 'http://bitbucket.com', ...] # upto 1 billion urls :o
# looping through the billion URLs 
for url in urls:
    req = requests.get(url).text # making the request
    soup = BeautifulSoup(req, 'html5lib') 
    meta_content = soup.findAll('meta', content=True) # here you get your meta tag contents
    title_content = soup.findAll('title') # here you get your title tag contents
    print ("Meta for %s: %s" % (url, meta_content))
    print ("Title for %s: %s" % (url, title_content))

注意：html.parser无法正确解析<meta>标签。它没有意识到它们是自动关闭的，因此我使用了html5lib库。

从数百万个网址中刮取标题和元标记

1 个答案: