从多个网址中抓取表格

时间:2020-04-02 23:15:23

标签: python url beautifulsoup

您好,我已经能够抓取表格并将其从特定网站导出,但是我想添加更多网站以进行抓取。它仅返回我输入的第二个URL。预先致歉,因为我对Python不太熟悉。谢谢。

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650', 'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']

for url in urls:
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
    page = requests.get(url,headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find_all('table')[3]
    df = pd.read_html(str(table))[0]

print(df)

1 个答案:

答案 0 :(得分:0)

好吧,这里的问题是您在tables上循环而没有附加。然后您printing就离开了。

示例

for item in range(1, 4):
    pass

print(item)

现在输出为:

3

因为它是loop中最后返回的元素。

但是,如果我们添加如下内容:

result = []
for item in range(1, 4):
    result.append(item)

print(result)

因此我们将获得以下信息:

[1, 2, 3]

现在,让我们转到下一点,您已经可以使用pandas.read_html直接读取table,因为urllib3已经在pandas的背景下,如下所示:

import pandas as pd

df = pd.read_html(
    "http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650")[3]

print(df)

但是由于网站TCP层已配置为Connection: close ref

HTTP / 1.1为发送者定义了“关闭”连接选项,以指示响应完成后将关闭连接。例如,

   Connection: close

因此,我们将在requests库下运行该脚本,并使用Session并通过附加server来维护requests.Session()对象不被table防火墙阻止每个url,然后使用table函数将其串联到一个pd.concat中,然后转换为csv using pd.to_csv()

import pandas as pd
import requests

urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650',
        'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}


def main(urls):
    goal = []
    with requests.Session() as req:
        for url in urls:
            r = req.get(url, headers=headers)
            df = pd.read_html(r.content)[3]
            goal.append(df)
    goal = pd.concat(goal)
    goal.to_csv("data.csv", index=False)


main(urls)

输出:View Online

enter image description here

根据用户请求更新的代码:

import pandas as pd
import requests

urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650',
        'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}


def main(urls):
    goal = []
    with requests.Session() as req:
        for url in urls:
            r = req.get(url, headers=headers)
            df = pd.read_html(r.content)[2:4]
            for table in df:
                goal.append(table)
    goal = pd.concat(goal)
    goal.to_csv("data.csv", index=False)


main(urls)

输出:view-online