Question

我正在尝试构建一个网络抓取工具，以为我的数据可视化项目创建covid-19数据集。我需要来自https://www.worldometers.info/coronavirus/

的表格

import requests
from bs4 import BeautifulSoup

url = "https://www.worldometers.info/coronavirus/"
page = requests.get(url,verify=True)

soup = BeautifulSoup(page.content,features="lxml")

rows = soup.select("tr")


for data in rows:
    print(data.text)

我得到了期望的输出，但是在每行（国家）中，它也显示了我不想包含在数据集中的大陆名称。有什么解决办法吗？由于我是网络爬虫的新手，所以我需要我能获得的所有帮助。

更新：这是html代码，数据集中不需要最后一个指定“欧洲”的td。

<tr style="" role="row" class="odd">
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/uk/">UK</a></td>
<td style="font-weight: bold; text-align:right" class="sorting_1">211,364</td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right;">31,241 </td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right">N/A</td>
<td style="text-align:right;font-weight:bold;">179,779</td>
<td style="font-weight: bold; text-align:right">1,559</td>
<td style="font-weight: bold; text-align:right">3,114</td>
<td style="font-weight: bold; text-align:right">460</td>
<td style="font-weight: bold; text-align:right">1,631,561</td>
<td style="font-weight: bold; text-align:right">24,034</td>
<td style="display:none" data-continent="Europe">Europe</td>
</tr>

Answer 1

尝试下面的代码。 beautifulSoup的关键功能是find和findAll。阅读下面的完整文档/示例。您应该设法收集想要的东西。

编辑：Continet具有“数据大陆”属性。然后，应循环查找没有此属性的行。请注意，这与“世界”行相同，因此我“手动”忽略了它。这是修改后的代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.worldometers.info/coronavirus/"
page = requests.get(url,verify=True)
soup = BeautifulSoup(page.content,features="lxml")

# find the table with id: 'main_table_countries_today'
table = soup.find('table', {'id': 'main_table_countries_today'})
body = table.find('tbody')

# looping through all rows, without 'data-continent' attribute :
for row in body.findAll('tr', {'data-continent': None}):
    print('\nParsing a new line:')
    values = row.findAll('td')
    # looping through all cells inside the row, ignoring the 'World' one:
    if values[0].text != 'World':
        for val in values:
            print(val.text)

结果是：

Parsing a new line:

Parsing a new line:
USA
1,322,223
+438
78,622 
+7
223,749
1,019,852
16,978
3,995
238
8,638,846
26,099
North America

Parsing a new line:
Spain
262,783
+2,666
26,478 
+179
173,157
63,148
1,741
5,620
566
1,932,455
41,332
Europe

Parsing a new line:
Italy
217,185

30,201 

99,023
87,961
1,168
3,592
500
2,445,063
40,440
Europe
[...]

Answer 2

您的代码将获取所有tr标签，而与它们的位置无关。您需要指定表。您对第一个表主体中的数据感兴趣。

response = requests.get(URL)

soup = BeautfiulSoup(response.text,'html.parser')
tbody = soup.find('tbody') # Selecting the first tbody
rows = tbody.find_all('tr')

for row in rows:
    print(row.text)

希望这会有所帮助。

Answer 3

另一种解决方案。

from simplified_scrapy import SimplifiedDoc,utils
html = '''
<tr style="" role="row" class="odd">
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/uk/">UK</a></td>
<td style="font-weight: bold; text-align:right" class="sorting_1">211,364</td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right;">31,241 </td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right">N/A</td>
<td style="text-align:right;font-weight:bold;">179,779</td>
<td style="font-weight: bold; text-align:right">1,559</td>
<td style="font-weight: bold; text-align:right">3,114</td>
<td style="font-weight: bold; text-align:right">460</td>
<td style="font-weight: bold; text-align:right">1,631,561</td>
<td style="font-weight: bold; text-align:right">24,034</td>
<td style="display:none" data-continent="Europe">Europe</td>
</tr>
'''
doc = SimplifiedDoc(html)
rows = doc.selects('tr').selects('td')
for data in rows:
  print(data.notContains('display:none',attr="style").text)

结果：

['UK', '211,364', '', '31,241', '', 'N/A', '179,779', '1,559', '3,114', '460', '1,631,561', '24,034']

还有更多示例。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

使用不同的类从HTML Table抓取数据

3 个答案: