在 Python 中对表进行网络抓取时,返回一个空表

时间:2021-04-18 01:49:58

标签: python

我需要使用 Python 中的 BeautifulSoup 库通过网页抓取从网站上抓取一张表格。来自网址 https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html

当我运行这段代码时,我得到一个空表:

import requests
from bs4 import BeautifulSoup
#
vaacineProgressResponse = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
vaacineProgressContent = BeautifulSoup(vaacineProgressResponse.content, 'html.parser')
vaacineProgressContentTable = vaacineProgressContent.find_all('table', class_="g-summary-table  svelte-2wimac")
if vaacineProgressContentTable is not None and len(vaacineProgressContentTable) > 0:
    vaacineProgressContentTable = vaacineProgressContentTable[0]
#
print ('the table =', vaacineProgressContentTable)

输出:

the table = []

Process finished with exit code 0

下面的屏幕截图显示了网页中的表格(左侧)和相关的检查元素部分(右侧):

enter image description here

2 个答案:

答案 0 :(得分:3)

很简单 - 这是因为您要搜索的班级中有一个额外的空间。

如果您将类更改为 g-summary-table svelte-2wimac,标签应该会正确返回。

以下代码应该可以工作:

import requests
from bs4 import BeautifulSoup
#
url = requests.get("https://www.nytimes.com/interactive/2021/world/covid-vaccinations-tracker.html")
soup = BeautifulSoup(url.content, 'html.parser')
table = soup.find_all('table', class_="g-summary-table svelte-2wimac")
print(table)

我也在 NYTimes 交互式网站上进行了类似的抓取,空间可能非常棘手。如果您添加了额外的空格或遗漏了一个空格,则返回空结果。

如果您找不到标签,我建议您先使用 print(soup.prettify()) 打印整个文档,然后找到您计划抓取的所需标签。确保从 BeautifulSoup 打印的内容中复制类名的准确文本

答案 1 :(得分:1)

或者,如果你想下载json格式的数据,然后读入pandas,你可以这样做。与上面相同的起始代码并处理汤对象

有几个可用的 api(以下是三个),但从 html 中提取出来,例如:

import re
import pandas as pd

latest_dataset = soup.find(string=re.compile('latest')).splitlines()[2].split('"')[1]
requests.get(latest_dataset).json()

latest_timeseries = soup.find(string=re.compile('timeseries')).splitlines()[2].split('"')[3]
requests.get(latest_timeseries).json()

allwithrate = soup.find(string=re.compile('all_with_rate')).splitlines()[2].split('"')[1]
requests.get(allwithrate).json()
pd.DataFrame(requests.get(allwithrate).json())

最后一个输出

    geoid    location last_updated  total_vaccinations  people_vaccinated     display_name  ...                      Region          IncomeGroup                    country  gdp_per_cap  vaccinations_rate people_fully_vaccinated
0     MUS   Mauritius   2021-02-17              3843.0             3843.0        Mauritius  ...          Sub-Saharan Africa          High income                  Mauritius  11099.24028             0.3037                     NaN
1     DZA     Algeria   2021-02-19             75000.0                NaN          Algeria  ...  Middle East & North Africa  Lower middle income                    Algeria  3973.964072             0.1776                     NaN
2     LAO        Laos   2021-03-17             40732.0            40732.0             Laos  ...         East Asia & Pacific  Lower middle income                    Lao PDR   2534.89828             0.5768                     NaN
3     MOZ  Mozambique   2021-03-23             57305.0            57305.0       Mozambique  ...          Sub-Saharan Africa           Low income                 Mozambique  503.5707727             0.1943                     NaN
4     CPV  Cape Verde   2021-03-24              2184.0             2184.0       Cape Verde  ...          Sub-Saharan Africa  Lower middle income                 Cabo Verde  3603.781793             0.4016                     NaN
..    ...         ...          ...                 ...                ...              ...  ...                         ...                  ...                        ...          ...                ...                     ...
243   GUF         NaN          NaN                 NaN                NaN    French Guiana  ...                         NaN                  NaN                        NaN          NaN                NaN                     NaN
244   KOS         NaN          NaN                 NaN                NaN           Kosovo  ...                         NaN                  NaN                        NaN          NaN                NaN                     NaN
245   CUW         NaN          NaN                 NaN                NaN          Cura�ao  ...   Latin America & Caribbean          High income                    Curacao  19689.13982                NaN                     NaN
246   CHI         NaN          NaN                 NaN                NaN  Channel Islands  ...       Europe & Central Asia          High income            Channel Islands  74462.64675                NaN                     NaN
247   SXM         NaN          NaN                 NaN                NaN     Sint Maarten  ...   Latin America & Caribbean          High income  Sint Maarten (Dutch part)  29160.10381                NaN                     NaN

[248 rows x 17 columns]