我需要使用 Python 中的 BeautifulSoup 库通过网页抓取从网站上抓取一张表格。来自网址 https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html
当我运行这段代码时,我得到一个空表:
import requests
from bs4 import BeautifulSoup
#
vaacineProgressResponse = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
vaacineProgressContent = BeautifulSoup(vaacineProgressResponse.content, 'html.parser')
vaacineProgressContentTable = vaacineProgressContent.find_all('table', class_="g-summary-table svelte-2wimac")
if vaacineProgressContentTable is not None and len(vaacineProgressContentTable) > 0:
vaacineProgressContentTable = vaacineProgressContentTable[0]
#
print ('the table =', vaacineProgressContentTable)
输出:
the table = []
Process finished with exit code 0
下面的屏幕截图显示了网页中的表格(左侧)和相关的检查元素部分(右侧):
答案 0 :(得分:3)
很简单 - 这是因为您要搜索的班级中有一个额外的空间。
如果您将类更改为 g-summary-table svelte-2wimac
,标签应该会正确返回。
以下代码应该可以工作:
import requests
from bs4 import BeautifulSoup
#
url = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', class_="g-summary-table svelte-2wimac")
print(table)
我也在 NYTimes 交互式网站上进行了类似的抓取,空间可能非常棘手。如果您添加了额外的空格或遗漏了一个空格,则返回空结果。
如果您找不到标签,我建议您先使用 print(soup.prettify())
打印整个文档,然后找到您计划抓取的所需标签。确保从 BeautifulSoup 打印的内容中复制类名的准确文本。
答案 1 :(得分:1)
或者,如果你想下载json格式的数据,然后读入pandas,你可以这样做。与上面相同的起始代码并处理汤对象
有几个可用的 api(以下是三个),但从 html 中提取出来,例如:
import re
import pandas as pd
latest_dataset = soup.find(string=re.compile('latest')).splitlines()[2].split('"')[1]
requests.get(latest_dataset).json()
latest_timeseries = soup.find(string=re.compile('timeseries')).splitlines()[2].split('"')[3]
requests.get(latest_timeseries).json()
allwithrate = soup.find(string=re.compile('all_with_rate')).splitlines()[2].split('"')[1]
requests.get(allwithrate).json()
pd.DataFrame(requests.get(allwithrate).json())
最后一个输出
geoid location last_updated total_vaccinations people_vaccinated display_name ... Region IncomeGroup country gdp_per_cap vaccinations_rate people_fully_vaccinated
0 MUS Mauritius 2021-02-17 3843.0 3843.0 Mauritius ... Sub-Saharan Africa High income Mauritius 11099.24028 0.3037 NaN
1 DZA Algeria 2021-02-19 75000.0 NaN Algeria ... Middle East & North Africa Lower middle income Algeria 3973.964072 0.1776 NaN
2 LAO Laos 2021-03-17 40732.0 40732.0 Laos ... East Asia & Pacific Lower middle income Lao PDR 2534.89828 0.5768 NaN
3 MOZ Mozambique 2021-03-23 57305.0 57305.0 Mozambique ... Sub-Saharan Africa Low income Mozambique 503.5707727 0.1943 NaN
4 CPV Cape Verde 2021-03-24 2184.0 2184.0 Cape Verde ... Sub-Saharan Africa Lower middle income Cabo Verde 3603.781793 0.4016 NaN
.. ... ... ... ... ... ... ... ... ... ... ... ... ...
243 GUF NaN NaN NaN NaN French Guiana ... NaN NaN NaN NaN NaN NaN
244 KOS NaN NaN NaN NaN Kosovo ... NaN NaN NaN NaN NaN NaN
245 CUW NaN NaN NaN NaN Cura�ao ... Latin America & Caribbean High income Curacao 19689.13982 NaN NaN
246 CHI NaN NaN NaN NaN Channel Islands ... Europe & Central Asia High income Channel Islands 74462.64675 NaN NaN
247 SXM NaN NaN NaN NaN Sint Maarten ... Latin America & Caribbean High income Sint Maarten (Dutch part) 29160.10381 NaN NaN
[248 rows x 17 columns]