如何将<br/>和<p>转换为换行符?</p>

时间:2012-05-08 01:10:56

标签: python html xml regex

假设我的HTML中包含<p><br>个标记。然后,我将剥离HTML来清理标签。如何将它们变成换行符?

我正在使用Python的BeautifulSoup库,如果这有用的话。

4 个答案:

答案 0 :(得分:15)

如果没有一些细节,很难确定这完全符合你的要求,但这应该给你一个想法......它假设你的b标签包含在p元素中。

from BeautifulSoup import BeautifulSoup
import types

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, types.StringTypes):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text

page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""

soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
    line = replace_with_newlines(line)
    print line

运行此结果会导致......

(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt

Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$

答案 1 :(得分:3)

get_text似乎做了你需要的事情

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

答案 2 :(得分:1)

这是@Mike Pennington的答案的python3版本(它确实有帮助),我做了一个垃圾重构。

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'
    return text


def get_plain_text(soup):
    plain_text = ''
    lines = soup.find("body")
    for line in lines.findAll('p'):
        line = replace_with_newlines(line)
        plain_text+=line
    return plain_text

要使用它,只需将Beautifulsoup对象传递给get_plain_text methond。

soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)

答案 3 :(得分:-5)

我不完全确定你要完成什么,但如果你只是想删除HTML元素,我会使用像Notepad2之类的程序并使用全部替换功能 - 我认为您也可以使用“全部替换”插入新行。确保替换同时删除结尾的<p>元素(</p>)。另外,仅仅是一个FYI,正确的HTML5是<br />而不是<br>,但这并不重要。 Python不会是我的第一选择所以它有点超出我的知识领域,抱歉我帮不了多。