您好,我已经能够抓取表格并将其从特定网站导出,但是我想添加更多网站以进行抓取。它仅返回我输入的第二个URL。预先致歉,因为我对Python不太熟悉。谢谢。
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650', 'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']
for url in urls:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
page = requests.get(url,headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('table')[3]
df = pd.read_html(str(table))[0]
print(df)
答案 0 :(得分:0)
好吧,这里的问题是您在tables
上循环而没有附加。然后您printing
就离开了。
示例:
for item in range(1, 4):
pass
print(item)
现在输出为:
3
因为它是loop
中最后返回的元素。
但是,如果我们添加如下内容:
result = []
for item in range(1, 4):
result.append(item)
print(result)
因此我们将获得以下信息:
[1, 2, 3]
现在,让我们转到下一点,您已经可以使用pandas.read_html直接读取table
,因为urllib3
已经在pandas
的背景下,如下所示:
import pandas as pd
df = pd.read_html(
"http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650")[3]
print(df)
但是由于网站TCP
层已配置为Connection: close
ref。
HTTP / 1.1为发送者定义了“关闭”连接选项,以指示响应完成后将关闭连接。例如,
Connection: close
因此,我们将在requests
库下运行该脚本,并使用Session
并通过附加server
来维护requests.Session()
对象不被table
防火墙阻止每个url
,然后使用table
函数将其串联到一个pd.concat
中,然后转换为csv
using pd.to_csv()
:
import pandas as pd
import requests
urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650',
'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(urls):
goal = []
with requests.Session() as req:
for url in urls:
r = req.get(url, headers=headers)
df = pd.read_html(r.content)[3]
goal.append(df)
goal = pd.concat(goal)
goal.to_csv("data.csv", index=False)
main(urls)
输出:View Online
根据用户请求更新的代码:
import pandas as pd
import requests
urls = ['http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=7&defaultdisplay=y&passjobnumber=123821098&passdocnumber=01&allbin=1015650',
'http://a810-bisweb.nyc.gov/bisweb/BScanItemsRequiredServlet?requestid=6&defaultdisplay=y&passjobnumber=121054170&passdocnumber=01&allbin=1015650']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(urls):
goal = []
with requests.Session() as req:
for url in urls:
r = req.get(url, headers=headers)
df = pd.read_html(r.content)[2:4]
for table in df:
goal.append(table)
goal = pd.concat(goal)
goal.to_csv("data.csv", index=False)
main(urls)
输出:view-online