广告牌Hot 100刮刮问题

时间:2017-06-25 23:21:14

标签: python web-scraping

我一直在练习在Python上练习(我是一个新手),而且我正在解决这个问题。我试图从Billboard Hot 100中删除歌曲列表,结果不如我所需。

这是代码。如您所见,我将歌曲存储在字典中然后打印它们。     来自lxml import html     导入请求     page = requests.get('http://www.billboard.com/charts/hot-100')     tree = html.fromstring(page.content)     billboard = {}

for x in range(1, 51):

currSongY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[4]/div[3]/div/h2/text()'
currArtistY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[4]/div[3]/div/a/text()'

currSongX = tree.xpath(currSongY)
currArtistX = tree.xpath(currArtistY)

if currArtistX == '[]' and currSongX == '[]':
    currSongY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[3]/div[3]/div/h2/text()'
    currArtistY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[3]/div[3]/div/a/text()'
    currSongX = tree.xpath(currSongY)
    currArtistX = tree.xpath(currArtistY)

    if currArtistX == '[]' and currSongX == '[]':
        currSongY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[2]/div[3]/div/h2/text()'
        currArtistY = '//*[@id="main"]/div[2]/div/div[1]/article[' + str(x) + ']/div[1]/div[2]/div[3]/div/a/text()'
        currSongX = tree.xpath(currSongY)
        currArtistX = tree.xpath(currArtistY)

currSong = str(currSongX)[2:(len(str(currSongX))-2)]
#currArtist = str(currArtistX)[4:(len(str(currArtistX))-4)]
currArtist = str(currArtistX).replace("\\n","")
billboard[x] = (currSong, currArtist)

print (billboard)

结果如下:

> {1: ('Despacito', "['Luis Fonsi & Daddy Yankee Featuring Justin Bieber']"), 2: ('', '[]'), 3: ('', '[]'), 4: ('', '[]'), 5: ('', '[]'), 6: ('', '[]'), 7: ('', '[]'), 8: ('', '[]'), 9: ('', '[]'), 10: ('', '[]'), 11: ('', '[]'), 12: ('', '[]'), 13: ('', '[]'), 14: ('', '[]'), 15: ('', '[]'), 16: ('', '[]'), 17: ('', '[]'), 18: ('', '[]'), 19: ('', '[]'), 20: ('', '[]'), 21: ('', '[]'), 22: ('', '[]'), 23: ('Bad Liar', "['Selena Gomez']"), 24: ('', '[]'), 25: ('', '[]'), 26: ('', '[]'), 27: ('', '[]'), 28: ('', '[]'), 29: ('', '[]'), 30: ('', '[]'), 31: ('', '[]'), 32: ('', '[]'), 33: ('', '[]'), 34: ('', '[]'), 35: ('', '[]'), 36: ('', '[]'), 37: ('Everyday We Lit', "['YFN Lucci Featuring PnB Rock']"), 38: ('', '[]'), 39: ('', '[]'), 40: ('', '[]'), 41: ('', '[]'), 42: ('', '[]'), 43: ('', '[]'), 44: ('', '[]'), 45: ('', '[]'), 46: ('', '[]'), 47: ('', '[]'), 48: ('', '[]'), 49: ('', '[]'), 50: ('', '[]')}
>>> 

请帮忙!!!!!

1 个答案:

答案 0 :(得分:0)

最好让解析器在浏览HTML时为您完成一些工作;生成一个元素树,并在树中查找标签和属性。

以下代码适用于广告牌100:

from lxml import etree
from io import StringIO
import requests

page = requests.get('http://www.billboard.com/charts/hot-100')
html = etree.HTML(page.content)

parser = etree.HTMLParser()
tree = etree.parse(StringIO(unicode(etree.tostring(html))), parser)
root = tree.getroot()

billboard = []
for article in root.iter('article'):
    if ('data-songtitle' in article.attrib):
        currSong = article.attrib['data-songtitle']
        for item in article.iter('a'):
            if (('class' in item.attrib) and (item.attrib['class'] == 'chart-row__artist')):
                currArtist = item.text
                billboard.append((currSong.strip(), currArtist.strip()))
                break

for entry in billboard:
    print entry

希望这有帮助。