我一直在练习在Python上练习(我是一个新手),而且我正在解决这个问题。我试图从Billboard Hot 100中删除歌曲列表,结果不如我所需。
这是代码。如您所见,我将歌曲存储在字典中然后打印它们。 来自lxml import html 导入请求 page = requests.get('http://www.billboard.com/charts/hot-100') tree = html.fromstring(page.content) billboard = {}
for x in range(1, 51):
currSongY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[4]/div[3]/div/h2/text()'
currArtistY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[4]/div[3]/div/a/text()'
currSongX = tree.xpath(currSongY)
currArtistX = tree.xpath(currArtistY)
if currArtistX == '[]' and currSongX == '[]':
currSongY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[3]/div[3]/div/h2/text()'
currArtistY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[3]/div[3]/div/a/text()'
currSongX = tree.xpath(currSongY)
currArtistX = tree.xpath(currArtistY)
if currArtistX == '[]' and currSongX == '[]':
currSongY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[2]/div[3]/div/h2/text()'
currArtistY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[2]/div[3]/div/a/text()'
currSongX = tree.xpath(currSongY)
currArtistX = tree.xpath(currArtistY)
currSong = str(currSongX)[2:(len(str(currSongX))-2)]
#currArtist = str(currArtistX)[4:(len(str(currArtistX))-4)]
currArtist = str(currArtistX).replace("\\n","")
billboard[x] = (currSong, currArtist)
print (billboard)
结果如下:
> {1: ('Despacito', "['Luis Fonsi & Daddy Yankee Featuring Justin Bieber']"), 2: ('', '[]'), 3: ('', '[]'), 4: ('', '[]'), 5: ('', '[]'), 6: ('', '[]'), 7: ('', '[]'), 8: ('', '[]'), 9: ('', '[]'), 10: ('', '[]'), 11: ('', '[]'), 12: ('', '[]'), 13: ('', '[]'), 14: ('', '[]'), 15: ('', '[]'), 16: ('', '[]'), 17: ('', '[]'), 18: ('', '[]'), 19: ('', '[]'), 20: ('', '[]'), 21: ('', '[]'), 22: ('', '[]'), 23: ('Bad Liar', "['Selena Gomez']"), 24: ('', '[]'), 25: ('', '[]'), 26: ('', '[]'), 27: ('', '[]'), 28: ('', '[]'), 29: ('', '[]'), 30: ('', '[]'), 31: ('', '[]'), 32: ('', '[]'), 33: ('', '[]'), 34: ('', '[]'), 35: ('', '[]'), 36: ('', '[]'), 37: ('Everyday We Lit', "['YFN Lucci Featuring PnB Rock']"), 38: ('', '[]'), 39: ('', '[]'), 40: ('', '[]'), 41: ('', '[]'), 42: ('', '[]'), 43: ('', '[]'), 44: ('', '[]'), 45: ('', '[]'), 46: ('', '[]'), 47: ('', '[]'), 48: ('', '[]'), 49: ('', '[]'), 50: ('', '[]')} >>>
请帮忙!!!!!
答案 0 :(得分:0)
最好让解析器在浏览HTML时为您完成一些工作;生成一个元素树,并在树中查找标签和属性。
以下代码适用于广告牌100:
from lxml import etree
from io import StringIO
import requests
page = requests.get('http://www.billboard.com/charts/hot-100')
html = etree.HTML(page.content)
parser = etree.HTMLParser()
tree = etree.parse(StringIO(unicode(etree.tostring(html))), parser)
root = tree.getroot()
billboard = []
for article in root.iter('article'):
if ('data-songtitle' in article.attrib):
currSong = article.attrib['data-songtitle']
for item in article.iter('a'):
if (('class' in item.attrib) and (item.attrib['class'] == 'chart-row__artist')):
currArtist = item.text
billboard.append((currSong.strip(), currArtist.strip()))
break
for entry in billboard:
print entry
希望这有帮助。