识别标签

时间:2016-08-12 01:39:47

标签: python beautifulsoup

我正试图走下<a>标签的兄弟结构,其间有<br>个标签。当我尝试获取br标记的elem.name时,出现错误。有没有办法跳过这些br标签?

目前,我在解析之前会html = html.replace('<br>','\n'),但这会导致bsoup用换行符插入^ M个字符。

    r = requests.get(url, headers=headers)
    # page = r.text.replace('<br>','\n')
    soup = bsoup(r.text, 'html.parser')
    soup = soup.find('div', id='listAlbum')
    albums = soup.find_all('div', class_='album')
    for album in albums:
            name = album.text.replace('"','').replace(':','').rstrip()
            print(name)
            albumtask(name)
            song = album.next_sibling
            while song.name != 'div' and song.name != 'script':
                    if song.name != 'a' or song.get('id'):
                            song = song.next_sibling
                            continue
                    t = threading.Thread(target=tsong, args=(song,))
                    t.start()
                    song = song.next_sibling
                    while song.is_empty_element:
                            song = song.next_sibling
                    time.sleep(0.2)

<div id="listAlbum">
<a id="1545"></a><div class="album">album: <b>"Pablo Honey"</b> (1993)<span>&nbsp;&nbsp;<a href="http://www.amazon.com/gp/search?ie=UTF8&amp;keywords=RADIOHEAD+Pablo+Honey&amp;tag=azlyricsunive-20&amp;index=music&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325" rel="external"><img width="30" height="18" src="http://images.azlyrics.com/amn.png" alt="buy this CD or download MP3s at amazon.com!"></a></span></div>
<a href="../lyrics/radiohead/you.html" target="_blank">You</a><br>
<a href="../lyrics/radiohead/creep.html" target="_blank">Creep</a><br>
<a href="../lyrics/radiohead/howdoyou.html" target="_blank">How Do You?</a><br>
<a href="../lyrics/radiohead/stopwhispering.html" target="_blank">Stop Whispering</a><br>
<a href="../lyrics/radiohead/thinkingaboutyou.html" target="_blank">Thinking About You</a><br>
<a href="../lyrics/radiohead/anyonecanplayguitar.html" target="_blank">Anyone Can Play Guitar</a><br>
<a href="../lyrics/radiohead/ripcord.html" target="_blank">Ripcord</a><br>
<a href="../lyrics/radiohead/vegetable.html" target="_blank">Vegetable</a><br>
<a href="../lyrics/radiohead/proveyourself.html" target="_blank">Prove Yourself</a><br>
<a href="../lyrics/radiohead/icant.html" target="_blank">I Can't</a><br>
<a href="../lyrics/radiohead/lurgee.html" target="_blank">Lurgee</a><br>
<a href="../lyrics/radiohead/blowout.html" target="_blank">Blow Out</a><br>

<a id="1543"></a><div class="album">EP: <b>"My Iron Lung"</b> (1994)<span>&nbsp;&nbsp;<a href="http://www.amazon.com/gp/search?ie=UTF8&amp;keywords=RADIOHEAD+My+Iron+Lung&amp;tag=azlyricsunive-20&amp;index=music&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325" rel="external"><img width="30" height="18" src="http://images.azlyrics.com/amn.png" alt="buy this CD or download MP3s at amazon.com!"></a></span></div>
<a href="../lyrics/radiohead/myironlung.html" target="_blank">My Iron Lung</a><br>

它继续这样。

1 个答案:

答案 0 :(得分:1)

我会首先遍历每个专辑 - 这些是与#listAlbum .album CSS选择器匹配的元素。现在,对于每张专辑find all a following siblings并迭代他们收集歌曲标题。遇到带有id的元素时,请中断。实现:

from collections import defaultdict
from pprint import pprint

from bs4 import BeautifulSoup


data = """
<div id="listAlbum">
    <a id="1545"></a><div class="album">album: <b>"Pablo Honey"</b> (1993)<span>&nbsp;&nbsp;<a href="http://www.amazon.com/gp/search?ie=UTF8&amp;keywords=RADIOHEAD+Pablo+Honey&amp;tag=azlyricsunive-20&amp;index=music&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325" rel="external"><img width="30" height="18" src="http://images.azlyrics.com/amn.png" alt="buy this CD or download MP3s at amazon.com!"></a></span></div>
    <a href="../lyrics/radiohead/you.html" target="_blank">You</a><br>
    <a href="../lyrics/radiohead/creep.html" target="_blank">Creep</a><br>
    <a href="../lyrics/radiohead/howdoyou.html" target="_blank">How Do You?</a><br>
    <a href="../lyrics/radiohead/stopwhispering.html" target="_blank">Stop Whispering</a><br>
    <a href="../lyrics/radiohead/thinkingaboutyou.html" target="_blank">Thinking About You</a><br>
    <a href="../lyrics/radiohead/anyonecanplayguitar.html" target="_blank">Anyone Can Play Guitar</a><br>
    <a href="../lyrics/radiohead/ripcord.html" target="_blank">Ripcord</a><br>
    <a href="../lyrics/radiohead/vegetable.html" target="_blank">Vegetable</a><br>
    <a href="../lyrics/radiohead/proveyourself.html" target="_blank">Prove Yourself</a><br>
    <a href="../lyrics/radiohead/icant.html" target="_blank">I Can't</a><br>
    <a href="../lyrics/radiohead/lurgee.html" target="_blank">Lurgee</a><br>
    <a href="../lyrics/radiohead/blowout.html" target="_blank">Blow Out</a><br>

    <a id="1543"></a><div class="album">EP: <b>"My Iron Lung"</b> (1994)<span>&nbsp;&nbsp;<a href="http://www.amazon.com/gp/search?ie=UTF8&amp;keywords=RADIOHEAD+My+Iron+Lung&amp;tag=azlyricsunive-20&amp;index=music&amp;linkCode=ur2&amp;camp=1789&amp;creative=9325" rel="external"><img width="30" height="18" src="http://images.azlyrics.com/amn.png" alt="buy this CD or download MP3s at amazon.com!"></a></span></div>
    <a href="../lyrics/radiohead/myironlung.html" target="_blank">My Iron Lung</a><br>
</div>"""

soup = BeautifulSoup(data, "html5lib")
albums = defaultdict(list)
for album in soup.select("#listAlbum .album"):
    album_title = album.get_text().strip()
    for song in album.find_next_siblings("a"):
        if "id" in song.attrs:
            break

        song_title = song.get_text(strip=True)
        albums[album_title].append(song_title)

pprint(dict(albums))

打印:

{'EP: "My Iron Lung" (1994)': ['My Iron Lung'],
 'album: "Pablo Honey" (1993)': ['You',
                                 'Creep',
                                 'How Do You?',
                                 'Stop Whispering',
                                 'Thinking About You',
                                 'Anyone Can Play Guitar',
                                 'Ripcord',
                                 'Vegetable',
                                 'Prove Yourself',
                                 "I Can't",
                                 'Lurgee',
                                 'Blow Out']}