如何使用BeautifulSoup获取<hr class =“calibre2”/> ... <hr class =“calibre2”/>之间的内容

时间:2016-07-31 03:57:22

标签: python beautifulsoup html-parsing

<hr class="calibre2" />
<h3 class="calibre5">-ability</h3> (in nouns 构成名词) : <br class="calibre4" />
<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ capability 能力 </span></p></blockquote>

<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ responsibility 责任 </span></p></blockquote>

<hr class="calibre2" />
<h3 class="calibre5">-ibility</h3> (in nouns 构成名词) : <br class="calibre4" />
<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ capability 能力 </span></p></blockquote>

<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ responsibility 责任 </span></p></blockquote>

<hr class="calibre2" />

上面这是我汤的一部分,我希望得到两个<hr>之间的内容,因为hr不是一个密切的标签,所以我无法使用一些简单的方法,我认为如果我可以使用find_next_elements,但是当他看到<hr class = 'calibre2'>时,怎么能让他停下来,所以我可以获得这些内容,谢谢。

2 个答案:

答案 0 :(得分:2)

您可以循环遍历所有hr元素,并使用.find_next_siblings()迭代下一个兄弟元素。然后,如果您遇到hr,请打破循环:

for hr in soup.find_all("hr", class_="calibre2"):
    for item in hr.find_next_siblings():
        if item.name == "hr":
            break

        print(item)
    print("-----")

答案 1 :(得分:0)

您可以与find_all_next一起检查hr和calibre2类 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all-next-and-find-next

from bs4 import BeautifulSoup

testStr = """
<hr class="calibre2" />
<h3 class="calibre5">-ability</h3> (in nouns 构成名词) : <br class="calibre4" />
<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ capability 能力 </span></p></blockquote>

<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ responsibility 责任 </span></p></blockquote>

<hr class="calibre2" />
<h3 class="calibre5">-ibility</h3> (in nouns 构成名词) : <br class="calibre4" />
<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ capability 能力 </span></p></blockquote>

<blockquote class="calibre6"><p class="calibre_1"><span class="italic">◊ responsibility 责任 </span></p></blockquote>

<hr class="calibre2" />
""";
soup = BeautifulSoup(testStr, 'lxml')
hrTag = soup.hr

nextTags = hrTag.find_all_next()

content = []

for item in nextTags:
    # check if we have reached the second calibre2 hr
    print("Name %s ; Class %s" % (item.name, item['class'][0]))
    if item.name == 'hr' and item['class'][0] == 'calibre2':
        break
    content.append(item)
print(content)