BeautifulSoup:在满足停止条件之前找到所有标签

时间:2017-12-09 00:57:19

标签: python html beautifulsoup

我试图从HTML文件中提取类标记,但前提是它位于给定的停止点之前。我所拥有的是:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")

这样可行,但它会找到 myclass的所有个实例,我只希望之前的以下文本显示在soup中:

<h4 class="cat-title" id="55">
 Title text N1
 <small>
  Title text N2.
 </small>
</h4>

使该块唯一的是Title text N行,尤其是Title text N2.行。之前有很多cat-title个标签,因此我无法将其用作停止条件。

围绕此块的代码如下所示:

...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...

继续上下。

我该怎么做?

3 个答案:

答案 0 :(得分:1)

您可以尝试这样的事情:

from bs4 import BeautifulSoup

page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
 Title text N1
 <small>
  Title text N2.
 </small>
</h4>

<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')

for i in soup.find_all():
    if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
        if i.find("small") and i.find("small").text.strip()== "Title text N2.":
            break
    elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
        print (i)

输出:

<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

答案 1 :(得分:1)

尝试使用import requests from bs4 import BeautifulSoup page = requests.get("https://mysite") soup = BeautifulSoup(page.content, 'html.parser') stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag class_extr = stop_at.find_all_previous("span", class_="myclass")

<h4 class='cat-title', id=55>

如果有多个。

,这将停在第一个/usr/lib/gcc/x86_64-linux-gnu/4.7/../../../x86_64-linux-gnu/crt1.o: In function `_start': (.text+0x20): 标记处

参考:Beautiful Soup Documentation

答案 2 :(得分:1)

这个怎么样:

page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")