使用beautifulsoup从html解析数据中提取<br/>记录

时间:2018-09-24 09:21:06

标签: beautifulsoup tags html-parsing

我有这样的在线文本数据:

    plain_text= "<a href="/url?q=https://www.aarnoldmovingcompany.com/contact-us/&amp;sa=U&amp;ved=0ahUKEwCgAMAA&amp;usg=AOvVaw1pasRFOwk">
        </b> Moving Louisville - Headquarters.<br>
commercial moving services nationwide. Visit our website today to learn more!<br><div class="osl">
<br>
         5200 Interchange Way Louisville, KY 40229.<br>
         ... <b> A. Arnold</b>"

我正在尝试从此文本中提取所有<br>标签,因此输出将类似于:

commercial moving services nationwide. Visit our website today to learn more

5200 Interchange Way Louisville, KY 40229.

这对我不起作用:

 soup=BeautifulSoup(plain_text,"lxml")
 out=soup.find_all('br')

它把我扔了

[<br/>,
 <br/>]

1 个答案:

答案 0 :(得分:0)

您可以使用next_sibling,请检查下面的代码。

from bs4 import BeautifulSoup
text = """<a href="/url?q=https://www.aarnoldmovingcompany.com/contact-us/&amp;sa=U&amp;ved=0ahUKEwCgAMAA&amp;usg=AOvVaw1pasRFOwk">
        </b> Moving Louisville - Headquarters.<br>
commercial moving services nationwide. Visit our website today to learn more!<br><div class="osl">
<br>
         5200 Interchange Way Louisville, KY 40229.<br>
         ... <b> A. Arnold</b>"""

soup = BeautifulSoup(text,'lxml')
name = soup.br.next_sibling
address = name.next.next.text.strip()
print(name, '\n', address)

输出

 commercial moving services nationwide. Visit our website today to learn more!
 5200 Interchange Way Louisville, KY 40229.
         ...  A. Arnold