我正在学习如何在Python中使用BeautifulSoup库,并且为了练习,我尝试从Wikipedia页面上剥离类别标题: https://en.wikipedia.org/wiki/List_of_jazz_genres
我已经能够在代码中做到这一点了
from bs4 import BeautifulSoup
html = open("wiki-jazz.html", encoding="utf=8")
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all("table")[1]
td = table.find_all("td")
print(td)
table [1]包含我要访问的数据。更具体地说,我真的只需要位于以下标题属性内的数据:
</td>, <td><a href="/wiki/West_Coast_jazz" title="West Coast jazz">West Coast jazz</a>
我一直在为如何提取此信息而绞尽脑汁。我看过这里的其他文章,但不能完全到达那里。 谢谢。
答案 0 :(得分:0)
要打印表的第一列,可以遍历行(<tr>
),然后获取行的所有单元格(<td>
)。每行的第一个单元格是您的爵士风格:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_jazz_genres'
soup = BeautifulSoup(requests.get(url).content, "html.parser")
table = soup.find_all("table")[1]
for row in table.find_all('tr')[1:]: # <-- [1:] because we don't want the header
cells = [td.get_text(strip=True) for td in row.find_all('td')]
print(cells[0])
打印:
Acid jazz
Afro-Cuban jazz
Avant-garde jazz
Bebop
Bossa nova
British dance band
Cape jazz
Chamber jazz
Continental jazz
Cool jazz
Crossover jazz
Dark jazz/Doomjazz[1][2][3]
Dixieland
Electro Swing
Ethio jazz
Ethno jazz
European free jazz
Free funk
Free jazz
Frevo
Gypsy jazz
Hard bop
Hot club
Indo jazz
Jazz blues
Jazz-funk
Jazz fusion
Jazz rap
Jazz rock
Kansas City blues
Kansas City jazz
Latin jazz
M-Base
Mainstream jazz
Modal jazz
Neo-bop jazz
Neo-swing
Neo-bop jazz
Novelty ragtime
Nu jazz
Orchestral jazz
Post-bop
Punk jazz
Ragtime
Ska jazz
Smooth jazz
Soul jazz
Straight-ahead jazz
Stride jazz
Swing
Third stream
Trad jazz
Vocal jazz
West Coast jazz
答案 1 :(得分:0)
您应该阅读Beautifulsoup文档,了解如何在诸如href src等标记中获取属性
您可以在这里使用
item[1].get(‘title’)