获得零会导致使用XPath进行网络抓取

时间:2015-12-25 01:34:44

标签: python xpath web-scraping

我正在使用以下两个函数来抓取页面以获取歌曲的下载链接。函数%253Cscript%253Ealert('XSS')%253C%252Fscript%253E 抓取链接并查找歌曲标题&专辑和功能get_song_details抓取另一个链接,找到作为参数传递的歌曲标题的链接。

get_download_url

以下代码在执行时效果很好。它会打印import requests from lxml import html import time def get_song_details(link): page = requests.get(link) tree = html.fromstring(page.content) # retrieve song title from page song = tree.xpath('//font[@class="general"]/b[2]/text()') if song: song = song[0].strip() else: raise ValueError("Song Title: Webpage structure has changed.") song = song.split("-")[0] if song.find("-") else song # retrieve album name from link tokens = link.split("/") album = tokens[5] if len(tokens) > 6 else None song_details = { "title": song, "album": album, } return song_details def get_download_url(song_details): title = song_details["title"] album = song_details["album"] url = "http://www.songspk.site/indian/anjaana_anjaani_2010.html" print song_details, url page = requests.get(url) tree = html.fromstring(page.content) download_url = tree.xpath('//a[contains(text(), "{0}")]/@href'.format(title)) return download_url -

['http://www.songspk.link/link1/song1.php?songid=7753', 'http://www.songspk.link/link1/song1.php?songid=7759']

但是,当我执行以下代码片段时,即使song_details = { "title": "Aas Paas Khuda", "album": "Anjaana Anjaani" } print get_download_url(song_details) 字典具有与上述硬编码片段相同的内容,我也会得到一个空列表。

song_details

我无法理解参数song_details = get_song_details("http://www.glamsham.com/music/lyrics/anjaana-anjaani/aas-pass-khuda/1368/3089.htm") print get_download_url(song_details) 与上面的代码段具有相同的标题,但即使它不起作用。

1 个答案:

答案 0 :(得分:0)

看起来其中一个页面上有拼写错误。请注意,您将歌曲标题设为Songs.PK,但在Aas Paas Khuda页面上只有PassPaas vs $numbers = array(); $numbers['a'] = 434343434343; $numbers['b'] = $numbers['a'] * 3; $numbers['c'] = $numbers['a'] * 6; foreach($numbers as $key => $val) { $numbers[$key] = number_format($val); }