我是python的新手。我想做的是使用蟒蛇和美丽的汤来提取今年从格拉斯顿伯里音乐节宣布的所有乐队。我想将所有乐队转储到文本文件中,最终根据每个艺术家的顶部曲目创建一个spotify播放列表。
我想从www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#中提取的艺术家列表(我实际上想要在A-Z标签而不是星期五标签上)
我先尝试将乐队打印到终端,但结果却是空白。这是我试过的
from bs4 import BeautifulSoup
import urllib2
#efestivals page with all glastonbury acts
url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
bands = soup.findAll('a')
for eachband in bands:
print eachband.string
基本上,我需要帮助才能进入A-Z标签并提取每个乐队。我也只想要确认的乐队(那些img src="/img2009/lineup_confirmed.gif"
)。我对html不是很熟悉,但这似乎是一个合理的起点。
答案 0 :(得分:1)
有很多方法可以解决这个问题。这只是一个似乎有用的例子:
from bs4 import BeautifulSoup
import urllib2 as ul
url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#"
page = ul.urlopen(url)
soup = BeautifulSoup(page.read())
elements = soup.findAll('img', {'src': '/img2009/lineup_confirmed.gif'})
bands = [e.next_element.next_element.text for e in elements]
print bands[1:11]
输出:
[u'Arctic Monkeys', u'Dizzee Rascal', u'The Vaccines', u'Kenny Rogers']
答案 1 :(得分:1)
从A-Z表中提取已确认波段的链接:
#!/usr/bin/env python
import re
try:
from urllib2 import urlopen
except ImportError: # Python 3
from urllib.request import urlopen
from bs4 import BeautifulSoup, NavigableString
def table_after_atoz(tag):
'''Whether tag is a <table> after an element with id="LUA to Z".'''
if tag.name == 'table' and 'TableLineupBox' in tag.get('class', ''):
for tag in tag.previous_elements: # go back
if not isinstance(tag, NavigableString): # skip strings
return tag.get('id') == "LUA to Z"
def confirmed_band_links(soup):
table = soup.find(table_after_atoz) # find A to Z table
for tr in table.find_all('tr'): # find all rows (including nested tables)
if tr.find('img', alt="confirmed"): # row with a confirmed band?
yield tr.find('a', href=re.compile(r'^/festivals/bands')) # a link
def main():
url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml"
soup = BeautifulSoup(urlopen(url))
for link in confirmed_band_links(soup):
print("%s\t%s" % (link['href'], link.string))
main()
答案 2 :(得分:0)
以下工作
from bs4 import BeautifulSoup
import urllib2
#efestivals page with all glasto acts
url = "http://www.efestivals.co.uk/festivals/glastonbury/2013/lineup.shtml#"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
bands = soup.findAll('a', href=True)
for band in bands:
if band['href'].startswith("/festivals/bands"):
print band.string