从Wikipedia页面抓取表格数据

时间:2020-10-13 20:34:22

标签: python html

我正在学习如何在Python中使用BeautifulSoup库,并且为了练习,我尝试从Wikipedia页面上剥离类别标题: https://en.wikipedia.org/wiki/List_of_jazz_genres

我已经能够在代码中做到这一点了

from bs4 import BeautifulSoup

html = open("wiki-jazz.html", encoding="utf=8")
soup = BeautifulSoup(html, "html.parser")

table = soup.find_all("table")[1]
td = table.find_all("td")
print(td)

table [1]包含我要访问的数据。更具体地说,我真的只需要位于以下标题属性内的数据:

</td>, <td><a href="/wiki/West_Coast_jazz" title="West Coast jazz">West Coast jazz</a>

我一直在为如何提取此信息而绞尽脑汁。我看过这里的其他文章,但不能完全到达那里。 谢谢。

2 个答案:

答案 0 :(得分:0)

要打印表的第一列,可以遍历行(<tr>),然后获取行的所有单元格(<td>)。每行的第一个单元格是您的爵士风格:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_jazz_genres'
soup = BeautifulSoup(requests.get(url).content, "html.parser")

table = soup.find_all("table")[1]

for row in table.find_all('tr')[1:]:    # <-- [1:] because we don't want the header
    cells = [td.get_text(strip=True) for td in row.find_all('td')]
    print(cells[0])

打印:

Acid jazz
Afro-Cuban jazz
Avant-garde jazz
Bebop
Bossa nova
British dance band
Cape jazz
Chamber jazz
Continental jazz
Cool jazz
Crossover jazz
Dark jazz/Doomjazz[1][2][3]
Dixieland
Electro Swing
Ethio jazz
Ethno jazz
European free jazz
Free funk
Free jazz
Frevo
Gypsy jazz
Hard bop
Hot club
Indo jazz
Jazz blues
Jazz-funk
Jazz fusion
Jazz rap
Jazz rock
Kansas City blues
Kansas City jazz
Latin jazz
M-Base
Mainstream jazz
Modal jazz
Neo-bop jazz
Neo-swing
Neo-bop jazz
Novelty ragtime
Nu jazz
Orchestral jazz
Post-bop
Punk jazz
Ragtime
Ska jazz
Smooth jazz
Soul jazz
Straight-ahead jazz
Stride jazz
Swing
Third stream
Trad jazz
Vocal jazz
West Coast jazz

答案 1 :(得分:0)

您应该阅读Beautifulsoup文档,了解如何在诸如href src等标记中获取属性

您可以在这里使用

item[1].get(‘title’)