Question

我有一个查询，因为我一直在剪贴网站“ https://www.zaubacorp.com/company-list”，因为它无法从表格中的给定链接中获取电子邮件ID。尽管需要从给定表中的链接中刮取姓名，电子邮件和董事。任何人都可以解决我的问题，因为我是使用python和漂亮的汤和请求进行Web爬网的新手。

谢谢迪克莎

 #Scrapping the website
#Import a liabry to query a website
import requests
#Specify the URL
companies_list = "https://www.zaubacorp.com/company-list"
link = requests.get("https://www.zaubacorp.com/company-list").text
#Import BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(link,'lxml')
soup.table.find_all('a')
all_links = soup.table.find_all('a')
for link in all_links:
    print(link.get("href"))

Answer 1

好，让我们分解一下网站，看看我们能做什么。

首先，我可以看到该网站是分页的。这意味着我们必须使用GET查询字符串的一部分来处理与网站一样简单的事情，以确定我们要单击某些AJAX调用所请求的页面，并在单击下一步时用新数据填充表格。通过单击下一页和后续页面，我们很幸运该网站使用了GET查询参数。

我们用于请求网页抓取的URL将是

https://www.zaubacorp.com/company-list/p-<page_num>-company.html

我们将编写一些代码，用介于1到要抓取的最后一页之间的值填充该页面num。在这种情况下，我们无需执行任何特殊操作即可确定表的最后一页，因为我们可以跳到最后一页，发现它将是第13333页。这意味着我们将向该网站发出13,333页的请求，以完全收集其所有数据。

对于从网站上收集数据，我们将需要找到包含信息的表，然后迭代选择元素以提取信息。

在这种情况下，我们实际上可以“欺骗”一点，因为页面上似乎只有一个躯干。我们要遍历所有内容并提取文本。我将继续编写示例。

import requests
import bs4

def get_url(page_num):
    page_num = str(page_num)
    return "https://www.zaubacorp.com/company-list/p-1" + page_num + "-company.html"

def scrape_row(tr):
    return [td.text for td in tr.find_all("td")]

def scrape_table(table):
    table_data = []
    for tr in table.find_all("tr"):
        table_data.append(scrape_row(tr))
    return table_data

def scrape_page(page_num):
    req = requests.get(get_url(page_num))
    soup = bs4.BeautifulSoup(req.content, "lxml")
    data = scrape_table(soup)
    for line in data:
        print(line)

for i in range(1, 3):
    scrape_page(i)

此代码将抓取网站的前两页，只需更改for循环范围，您就可以获取全部13,333页。从这里，您应该能够修改打印输出逻辑以保存为CSV。

如何使用bs4在python中抓取多个页面

1 个答案: