Question

搜索了SO，但无法找到任何相关内容。

我使用beautifulsoup进行刮擦......这是我在SO上找到的代码：

for section in soup.findAll('div',attrs={'id':'dmusic_tracklist_track_title_B00KHQOKGW'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

如果去了这张50美分专辑（Animal Ambition: An Untamed Desire To Win）并且想要收录每首歌曲，我该怎么做？问题是每首歌曲根据其产品代码具有与之关联的不同ID。例如，这是前两首歌曲的XPath＆＃39;标题：//*[@id="dmusic_tracklist_track_title_B00KHQOKGW"]/div/a/text()和//*[@id="dmusic_tracklist_track_title_B00KHQOLWK"]/div/a/text()。

您会注意到第一个ID的结尾是B00KHQOKGW，而第二个ID是B00KHQOLWK。有没有办法我可以在id的末尾添加一张＆＃34;外卡来抓取每首歌曲，无论最终产品ID是什么？例如，id="dmusic_tracklist_track_title_*之类的内容......我将产品ID替换为*。

或者我可以使用div来定位我想要的标题（我觉得这将是最好的。它使用标题上方的div类。没有＆＃39;其中的任何产品ID）：

for section in soup.findAll('div',attrs={'class':'a-section a-spacing-none overflow_ellipsis'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

Answer 1

您可以pass a function作为id属性值并检查starts with dmusic_tracklist_track_title_：

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')):
    print song.text.strip()

打印：

Hold On [Explicit]
Don't Worry 'Bout It [feat. Yo Gotti] [Explicit]
Animal Ambition [Explicit]
Pilot [Explicit]
Smoke [feat. Trey Songz] [Explicit]
Everytime I Come Around [feat. Kidd Kidd] [Explicit]
Irregular Heartbeat [feat. Jadakiss] [Explicit]
Hustler [Explicit]
Twisted [feat. Mr. Probz] [Explicit]
Winners Circle [feat. Guordan Banks] [Explicit]
Chase The Paper [feat. Kidd Kidd] [Explicit]

或者，您可以传递regular expression pattern作为属性值：

import re
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=re.compile('^dmusic_tracklist_track_title_\w+$')):
    print song.text.strip()

^dmusic_tracklist_track_title_\w+$将匹配dmusic_tracklist_track_title_，后跟一个或多个“字母数字”（0-9a-zA-Z和_）字符。

如何使用一个不同的属性来刮掉类似的类

1 个答案: