从<p>元素中提取文本

时间:2018-03-01 18:45:15

标签: python html web-scraping beautifulsoup

我正在使用BeautifulSoup编写脚本来从<p>元素中提取文本;它很有效,直到我遇到包含<p>标记的<br>元素,在这种情况下,它只捕获第一个<br>标记之前的文本。如何编辑我的代码以捕获所有文本?

我的代码:

coms = soup.select('li > div[class=comments]')[0].select('p')
inp = [i.find(text=True).lstrip().rstrip() for i in coms]

问题HTML(请注意<br>标记):

<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>

我的代码目前输出的内容:

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.'

我的代码应该输出什么(注意额外的文字):

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen. ITR info: Rachel Hoffman, CD Chris Kory, acc. Monitor is Iftiaz Haroon.'

注意:原谅我有时候有问题的术语;我很大程度上是自学成才。)

2 个答案:

答案 0 :(得分:0)

我担心这个问题可能是错误的。我将HTML复制到一个文件中,然后运行以下代码:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('matthew.htm').read(), 'lxml')
>>> soup.find('p').text
'             \n                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.\n\nITR info:\n\nRachel Hoffman, CD\nChris Kory, acc.\n\nMonitor is Iftiaz Haroon.                '

显然,恢复所需文本很简单。

答案 1 :(得分:0)

您可以使用get_text(strip=True)

来自文档:

  

如果您只想要文档或标记的文本部分,则可以使用get_text()方法。它返回文档中或标记下的所有文本,作为单个Unicode字符串。

     

你可以告诉Beautiful Soup使用strip=True从每个文本位的开头和结尾去除空格。

html = '''<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find('p').get_text(strip=True))

输出:

Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.ITR info:Rachel Hoffman, CDChris Kory, acc.Monitor is Iftiaz Haroon.