我想从页面中提取艺术家和歌曲名称。
页面: http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23
<div class="detail-body">
<h4 class="detail-heading" itemprop="name">No son of mine</h4>
<span itemprop="byArtist" itemscope="" itemtype="http://schema.org/MusicGroup"><link href="http://www.swr3.de/musik/poplexikon/-/id=927882/did=70326/i3zglz/index.html" itemprop="url">
<h5 itemprop="name">Genesis</h5>
这在页面上重复几次(参见顶部链接swr3.de),但我不知道如何用beautifulsoup&amp;创建一个列表。像python一样:
创世纪 - 我的儿子没有 双重你 - 请不要
答案 0 :(得分:0)
使用BeautifulSoup,requests和lxml的组合:
首先,安装先决条件:
pip install beautifulsoup4
pip install requests
pip install lxml
<强> swr3.py:强>
import requests, lxml
from bs4 import BeautifulSoup
parsedsongs = []
result = requests.get('http://www.swr3.de//-/id=47424/cf=42/did=65794/93avs/index.html?hour=5&date=2015-10-23')
soup = BeautifulSoup(result.content, "lxml")
detailbodys = soup.find_all('div', 'detail-body')
for detailbody in detailbodys:
title = detailbody.h4.string.encode('utf-8').strip()
if detailbody.h5:
artist = detailbody.h5.string.encode('utf-8').strip()
else:
artist = detailbody.span.string.encode('utf-8').strip()
parsedsongs.append({'artist': artist, 'title': title})
for entry in parsedsongs:
print 'Artist: {}\tTitle: {}'.format(entry['artist'], entry['title'])
输出:
(swr3)macbook:swr3 joeyoung$ python swr3.py
Artist: Vaya Con Dios Title: Nah neh nah
Artist: Genesis Title: No son of mine
Artist: Genesis Title: No son of mine
Artist: Double You Title: Please don't go
Artist: Stereo MC's Title: Step it up
Artist: Cranberries Title: Zombie
Artist: La Bouche Title: Sweet dreams
Artist: Die Prinzen Title: Du mußt ein Schwein sein
Artist: Bad Religion Title: Punk rock song
Artist: Bellini Title: Samba de Janeiro
Artist: Dion, Celine; Bee Gees Title: Immortality
Artist: Jones, Tom; Mousse T. Title: Sex bomb
Artist: Yanai, Kate Title: Bacardi feeling (Summer dreamin')
Artist: Heroes Del Silencio Title: Entre dos tierras