Question

我仍然不了解使用BeautifulSoup。我可以使用它来解析网页的原始HTML，这里＆＃34; example_website.com＆＃34;：

from bs4 import BeautifulSoup # load BeautifulSoup class
import requests 
r  = requests.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
# soup.find_all('a') grabs all elements with <a> tag for hyperlinks

然后，使用＆＃39; href＆＃39;检索并打印所有元素。属性，我们可以使用for循环：

for link in soup.find_all('a'):
    print(link.get('href'))

我不明白：我有一个包含多个网页的网站，每个网页都列出了几个带有表格数据的单个网页的超链接。

我可以使用BeautifulSoup来解析主页，但是如何使用相同的Python脚本来抓取第2页，第3页等等？你如何访问＆＃34;通过＆＃39; href＆＃39;链接？

有没有办法编写python脚本来执行此操作？我应该使用蜘蛛吗？

Answer 1

您可以使用requests + BeautifulSoup来做到这一点。它将具有阻塞性质，因为您将逐个处理提取的链接，并且在完成当前操作之前不会继续执行下一个链接。示例实施：

from urlparse import urljoin

from bs4 import BeautifulSoup 
import requests 

with requests.Session() as session:    
    r = session.get("http://example_website.com")
    data = r.text
    soup = BeautifulSoup(data)

    base_url = "http://example_website.com" 
    for link in soup.find_all('a'):
        url = urljoin(base_url, link.get('href'))

        r = session.get(url)
        # parse the subpage

尽管如此，它可能会很快变得复杂和缓慢。

您可能需要切换到Scrapy web-scraping framework进行网页抓取，抓取，轻松跟踪链接（使用链接提取器检查CrawlSpider），快速且无阻塞性质（它是基于Twisted）。

使用BeautifulSoup通过超链接访问表格数据

1 个答案: