我正在尝试构建一个网络抓取工具,以为我的数据可视化项目创建covid-19数据集。我需要来自https://www.worldometers.info/coronavirus/
的表格import requests
from bs4 import BeautifulSoup
url = "https://www.worldometers.info/coronavirus/"
page = requests.get(url,verify=True)
soup = BeautifulSoup(page.content,features="lxml")
rows = soup.select("tr")
for data in rows:
print(data.text)
我得到了期望的输出,但是在每行(国家)中,它也显示了我不想包含在数据集中的大陆名称。有什么解决办法吗? 由于我是网络爬虫的新手,所以我需要我能获得的所有帮助。
更新:这是html代码,数据集中不需要最后一个指定“欧洲”的td。
<tr style="" role="row" class="odd">
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/uk/">UK</a></td>
<td style="font-weight: bold; text-align:right" class="sorting_1">211,364</td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right;">31,241 </td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right">N/A</td>
<td style="text-align:right;font-weight:bold;">179,779</td>
<td style="font-weight: bold; text-align:right">1,559</td>
<td style="font-weight: bold; text-align:right">3,114</td>
<td style="font-weight: bold; text-align:right">460</td>
<td style="font-weight: bold; text-align:right">1,631,561</td>
<td style="font-weight: bold; text-align:right">24,034</td>
<td style="display:none" data-continent="Europe">Europe</td>
</tr>
答案 0 :(得分:0)
尝试下面的代码。 beautifulSoup的关键功能是find
和findAll
。阅读下面的完整文档/示例。您应该设法收集想要的东西。
编辑:Continet具有“数据大陆”属性。然后,应循环查找没有此属性的行。 请注意,这与“世界”行相同,因此我“手动”忽略了它。 这是修改后的代码:
import requests
from bs4 import BeautifulSoup
url = "https://www.worldometers.info/coronavirus/"
page = requests.get(url,verify=True)
soup = BeautifulSoup(page.content,features="lxml")
# find the table with id: 'main_table_countries_today'
table = soup.find('table', {'id': 'main_table_countries_today'})
body = table.find('tbody')
# looping through all rows, without 'data-continent' attribute :
for row in body.findAll('tr', {'data-continent': None}):
print('\nParsing a new line:')
values = row.findAll('td')
# looping through all cells inside the row, ignoring the 'World' one:
if values[0].text != 'World':
for val in values:
print(val.text)
结果是:
Parsing a new line:
Parsing a new line:
USA
1,322,223
+438
78,622
+7
223,749
1,019,852
16,978
3,995
238
8,638,846
26,099
North America
Parsing a new line:
Spain
262,783
+2,666
26,478
+179
173,157
63,148
1,741
5,620
566
1,932,455
41,332
Europe
Parsing a new line:
Italy
217,185
30,201
99,023
87,961
1,168
3,592
500
2,445,063
40,440
Europe
[...]
答案 1 :(得分:0)
您的代码将获取所有tr
标签,而与它们的位置无关。您需要指定表。您对第一个表主体中的数据感兴趣。
response = requests.get(URL)
soup = BeautfiulSoup(response.text,'html.parser')
tbody = soup.find('tbody') # Selecting the first tbody
rows = tbody.find_all('tr')
for row in rows:
print(row.text)
希望这会有所帮助。
答案 2 :(得分:0)
另一种解决方案。
from simplified_scrapy import SimplifiedDoc,utils
html = '''
<tr style="" role="row" class="odd">
<td style="font-weight: bold; font-size:15px; text-align:left;"><a class="mt_a" href="country/uk/">UK</a></td>
<td style="font-weight: bold; text-align:right" class="sorting_1">211,364</td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right;">31,241 </td>
<td style="font-weight: bold; text-align:right;"></td>
<td style="font-weight: bold; text-align:right">N/A</td>
<td style="text-align:right;font-weight:bold;">179,779</td>
<td style="font-weight: bold; text-align:right">1,559</td>
<td style="font-weight: bold; text-align:right">3,114</td>
<td style="font-weight: bold; text-align:right">460</td>
<td style="font-weight: bold; text-align:right">1,631,561</td>
<td style="font-weight: bold; text-align:right">24,034</td>
<td style="display:none" data-continent="Europe">Europe</td>
</tr>
'''
doc = SimplifiedDoc(html)
rows = doc.selects('tr').selects('td')
for data in rows:
print(data.notContains('display:none',attr="style").text)
结果:
['UK', '211,364', '', '31,241', '', 'N/A', '179,779', '1,559', '3,114', '460', '1,631,561', '24,034']
还有更多示例。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples