Question

我想从使用 的网站上获取一些数据。在使用beautifulsoup4解析的html中，有时我具有以下模式：

"<p class=some_class>text_1. text_2 (text_3<span class=GramE>)</span> 
<br> 
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'> 
</span>text_5.</p>"

但是，如果以更好的方式编写网站，则该网站看起来应该像这样：

"<p class=some_class>text_1. text_2(text_3<span class=GramE>)</span 
</p> <p class=some_class>
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'> 
</span>text_5.</p>

要提取我想要的字符串，我将提取每个中的所有文本。但是，现在我要分离的字符串由 分隔。

我的问题如下：如何使用 来解开我感兴趣的字符串部分？我的意思是，我想要类似[text_1.+text_2+text_3, text_4+text_5.]之类的东西。

我明确询问 的用法，因为这是我发现的唯一分隔我感兴趣的字符串的元素。此外，在网站的其他部分，我用 分隔了我感兴趣的字符串，而不是 。

由于我的对象是Tag froom bs4，因此无法使用replace（）函数解决此问题。另外，从bs4使用find（“ br”）会给我“  ”，而不是我想要的文本。这样，question中的答案就不是我想要的。我认为一种方法是将必须从bs4转换为html的标记，然后使用replace（）函数更改“  ”，最后将其转换回bs4元素。但是，我不知道如何进行此更改，我还想知道是否有更简单，更短的方法来完成此操作。

Answer 1

这是我找到的解决方案，但是它很长且效率很低，因为它没有使用bs4的任何功能。不过，它可以工作。

html_doc = """
"<p class=some_class>text_1. text_2 (text_3<span class=GramE>)</span> 
<br> 
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'> 
</span>text_5.</p>"
"""

def replace_br(soup_object):
    html1=str(soup_object)
    html1=html1.replace("<br>", "</p> <p>")
    soup_html1 = BeautifulSoup(html1, 'html.parser')
    return soup_html1.find_all("p")

replace_br(html_doc)
[<p class="some_class">text_1. text_2 (text_3<span class="GramE">)</span>
</p>, <p> 
text_4,<span style='mso-fareast-font-family:"Arial Unicode MS"'>
</span>text_5.</p>]

使用beautifulsoup分隔由`<br/>`分隔的字符串

1 个答案: