我正在尝试使用BeatifulSoup刮取页面
import urllib2
from bs4 import BeautifulSoup
url='http://www.xpn.org/playlists/xpn-playlist'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all("li", class_="song"):
print link
问题是我想要返回的文本没有包含在它自己的html标签中
<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()
" onmouseout="delayhidemenu()" onmouseover="dropdownmenu(this, event, menu1,
'100px','Death Vessel','Mandan Dink','Stay Close')">Buy</a>
Chuck Ragan - Rotterdam - Folkadelphia Session</li>
我想要归来的
Chuck Ragan - Rotterdam - Folkadelphia Session
奖励积分:返回的数据格式为艺术家/歌曲/专辑。用于存储和操作此信息的正确数据结构是什么?
答案 0 :(得分:1)
尝试类似:
for link in soup.find_all("li", class_="song"):
print link.text
输出:
Buy Chuck Ragan - Rotterdam - Folkadelphia Session
当然,如果您要删除Buy
,可以像这样使用slice
:
for link in soup.find_all("li", class_="song"):
print link.text.strip()[5:]
输出结果为:
Chuck Ragan - Rotterdam - Folkadelphia Session
如果您想将这些字符串保存在列表中:
[i.strip() for i in link.text.strip()[5:].split('-')]
输出:
['Chuck Ragan', 'Rotterdam', 'Folkadelphia Session']
有关详情,请查看document。
答案 1 :(得分:1)
这是另一种方式! (假设li
有3个孩子。如果没有,请将[2]
更改为[1]
):
>>> html = '''<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()
... " onmouseout="delayhidemenu()" onmouseover="dropdownmenu(this, event, menu1,
... '100px','Death Vessel','Mandan Dink','Stay Close')">Buy</a>
... Chuck Ragan - Rotterdam - Folkadelphia Session</li>'''
>>> from bs4 import BeautifulSoup as bs
>>> all_li = soup.findAll('li', class_='song')
>>> for li in all_li:
... text = list(li.children)[2]
... artist, song, album = text.split('-')
... print artist, song, album
Chuck Ragan Rotterdam Folkadelphia Session
答案 2 :(得分:0)
你可以使用这样的东西。
for l in soup.find_all("li", class_="song"):
album = l.text.split("-")[2]
song = l.text.split("-")[1]
artist = l.text.split("-")[0].split(" ")[1]
答案 3 :(得分:0)
**使用named tuple
结束存储**
from bs4 import BeautifulSoup
import urllib2
from collections import namedtuple
url='http://www.xpn.org/playlists/xpn-playlist'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
songs=[]
Song = namedtuple("Song", "artist name album")
for link in soup.find_all("li", class_="song"):
song = Song._make(link.text.strip()[12:].split(" - "))
songs.append(song)
for song in songs:
print(song.artist, song.name, song.album)