Question

我有这样的在线文本数据：

    plain_text= "<a href="/url?q=https://www.aarnoldmovingcompany.com/contact-us/&amp;sa=U&amp;ved=0ahUKEwCgAMAA&amp;usg=AOvVaw1pasRFOwk">
        </b> Moving Louisville - Headquarters.<br>
commercial moving services nationwide. Visit our website today to learn more!<br><div class="osl">
<br>
         5200 Interchange Way Louisville, KY 40229.<br>
         ... <b> A. Arnold</b>"

我正在尝试从此文本中提取所有<br>标签，因此输出将类似于：

commercial moving services nationwide. Visit our website today to learn more

5200 Interchange Way Louisville, KY 40229.

这对我不起作用：

 soup=BeautifulSoup(plain_text,"lxml")
 out=soup.find_all('br')

它把我扔了

[<br/>,
 <br/>]

Answer 1

您可以使用next_sibling，请检查下面的代码。

from bs4 import BeautifulSoup
text = """<a href="/url?q=https://www.aarnoldmovingcompany.com/contact-us/&amp;sa=U&amp;ved=0ahUKEwCgAMAA&amp;usg=AOvVaw1pasRFOwk">
        </b> Moving Louisville - Headquarters.<br>
commercial moving services nationwide. Visit our website today to learn more!<br><div class="osl">
<br>
         5200 Interchange Way Louisville, KY 40229.<br>
         ... <b> A. Arnold</b>"""

soup = BeautifulSoup(text,'lxml')
name = soup.br.next_sibling
address = name.next.next.text.strip()
print(name, '\n', address)

输出

 commercial moving services nationwide. Visit our website today to learn more!
 5200 Interchange Way Louisville, KY 40229.
         ...  A. Arnold

使用beautifulsoup从html解析数据中提取<br/>记录

1 个答案: