搜索了SO,但无法找到任何相关内容。
我使用beautifulsoup进行刮擦......这是我在SO上找到的代码:
for section in soup.findAll('div',attrs={'id':'dmusic_tracklist_track_title_B00KHQOKGW'}):
nextNode = section
while True:
nextNode = nextNode.nextSibling
try:
tag_name = nextNode.name
except AttributeError:
tag_name = ""
if tag_name == "a":
print nextNode.text()
else:
print "*****"
break
如果去了这张50美分专辑(Animal Ambition: An Untamed Desire To Win)并且想要收录每首歌曲,我该怎么做?问题是每首歌曲根据其产品代码具有与之关联的不同ID。例如,这是前两首歌曲的XPath'标题://*[@id="dmusic_tracklist_track_title_B00KHQOKGW"]/div/a/text()
和//*[@id="dmusic_tracklist_track_title_B00KHQOLWK"]/div/a/text()
。
您会注意到第一个ID的结尾是B00KHQOKGW
,而第二个ID是B00KHQOLWK
。有没有办法我可以在id的末尾添加一张"外卡来抓取每首歌曲,无论最终产品ID是什么?例如,id="dmusic_tracklist_track_title_*
之类的内容......我将产品ID替换为*
。
或者我可以使用div
来定位我想要的标题(我觉得这将是最好的。它使用标题上方的div类。没有'其中的任何产品ID):
for section in soup.findAll('div',attrs={'class':'a-section a-spacing-none overflow_ellipsis'}):
nextNode = section
while True:
nextNode = nextNode.nextSibling
try:
tag_name = nextNode.name
except AttributeError:
tag_name = ""
if tag_name == "a":
print nextNode.text()
else:
print "*****"
break
答案 0 :(得分:1)
您可以pass a function作为id
属性值并检查starts with dmusic_tracklist_track_title_
:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)
soup = BeautifulSoup(response.content)
for song in soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')):
print song.text.strip()
打印:
Hold On [Explicit]
Don't Worry 'Bout It [feat. Yo Gotti] [Explicit]
Animal Ambition [Explicit]
Pilot [Explicit]
Smoke [feat. Trey Songz] [Explicit]
Everytime I Come Around [feat. Kidd Kidd] [Explicit]
Irregular Heartbeat [feat. Jadakiss] [Explicit]
Hustler [Explicit]
Twisted [feat. Mr. Probz] [Explicit]
Winners Circle [feat. Guordan Banks] [Explicit]
Chase The Paper [feat. Kidd Kidd] [Explicit]
或者,您可以传递regular expression pattern作为属性值:
import re
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)
soup = BeautifulSoup(response.content)
for song in soup.find_all(id=re.compile('^dmusic_tracklist_track_title_\w+$')):
print song.text.strip()
^dmusic_tracklist_track_title_\w+$
将匹配dmusic_tracklist_track_title_
,后跟一个或多个“字母数字”(0-9a-zA-Z和_)字符。