如何使用一个不同的属性来刮掉类似的类

时间:2014-09-23 02:07:44

标签: python xpath web-scraping beautifulsoup

搜索了SO,但无法找到任何相关内容。

我使用beautifulsoup进行刮擦......这是我在SO上找到的代码:

for section in soup.findAll('div',attrs={'id':'dmusic_tracklist_track_title_B00KHQOKGW'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

如果去了这张50美分专辑(Animal Ambition: An Untamed Desire To Win)并且想要收录每首歌曲,我该怎么做?问题是每首歌曲根据其产品代码具有与之关联的不同ID。例如,这是前两首歌曲的XPath'标题://*[@id="dmusic_tracklist_track_title_B00KHQOKGW"]/div/a/text()//*[@id="dmusic_tracklist_track_title_B00KHQOLWK"]/div/a/text()

您会注意到第一个ID的结尾是B00KHQOKGW,而第二个ID是B00KHQOLWK。有没有办法我可以在id的末尾添加一张"外卡来抓取每首歌曲,无论最终产品ID是什么?例如,id="dmusic_tracklist_track_title_*之类的内容......我将产品ID替换为*

或者我可以使用div来定位我想要的标题(我觉得这将是最好的。它使用标题上方的div类。没有'其中的任何产品ID):

for section in soup.findAll('div',attrs={'class':'a-section a-spacing-none overflow_ellipsis'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

1 个答案:

答案 0 :(得分:1)

您可以pass a function作为id属性值并检查starts with dmusic_tracklist_track_title_

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')):
    print song.text.strip()

打印:

Hold On [Explicit]
Don't Worry 'Bout It [feat. Yo Gotti] [Explicit]
Animal Ambition [Explicit]
Pilot [Explicit]
Smoke [feat. Trey Songz] [Explicit]
Everytime I Come Around [feat. Kidd Kidd] [Explicit]
Irregular Heartbeat [feat. Jadakiss] [Explicit]
Hustler [Explicit]
Twisted [feat. Mr. Probz] [Explicit]
Winners Circle [feat. Guordan Banks] [Explicit]
Chase The Paper [feat. Kidd Kidd] [Explicit]

或者,您可以传递regular expression pattern作为属性值:

import re
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=re.compile('^dmusic_tracklist_track_title_\w+$')):
    print song.text.strip()

^dmusic_tracklist_track_title_\w+$将匹配dmusic_tracklist_track_title_,后跟一个或多个“字母数字”(0-9a-zA-Z和_)字符。