使用Python中的BeautifulSoup在HTML段落中提取文本

时间:2014-12-24 05:28:12

标签: python html web-scraping beautifulsoup

<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>

这是一个我希望在Python中使用BeautifulSoup从HTML页面中删除的段落。 我可以使用.children和amp;来获取标签内的值。 .string方法。 但我无法获得文本“几个新的销售点恶意软件fa ...”这是段内没有任何标签。我尝试使用soup.p.text,.get_text()等。但没有用。

1 个答案:

答案 0 :(得分:1)

使用find_all()text=True一起查找所有文本节点,recursive=False仅在父p标记的直接子项中进行搜索:

from bs4 import BeautifulSoup

data = """
<p>
    <a name="533660373"></a>
    <strong>Title: Point of Sale Threats Proliferate</strong><br />
    <strong>Severity: Normal Severity</strong><br />
    <strong>Published: Thursday, December 04, 2014 20:27</strong><br />
    Several new Point of Sale malware families have emerged recently, to include LusyPOS,..<br />
    <em>Analysis: Emboldened by past success and media attention, threat actors  ..</em>
    <br />
</p>
"""

soup = BeautifulSoup(data)
print ''.join(text.strip() for text in soup.p.find_all(text=True, recursive=False))

打印:

Several new Point of Sale malware families have emerged recently, to include LusyPOS,..