Question

抱歉，另一个python新手问题。我有一个字符串：

my_string = "<p>this is some \n fun</p>And this is \n some more fun!"

我想：

my_string = "<p>this is some fun</p>And this is \n some more fun!"

换句话说，如果它出现在html标签内，我如何摆脱'\ n' ？

我有：

my_string = re.sub('<(.*?)>(.*?)\n(.*?)</(.*?)>', 'replace with what???', my_string)

这显然不起作用，但我被卡住了。

Answer 1

正则表达式与HTML不匹配。不要这样做。请参阅RegEx match open tags except XHTML self-contained tags。

相反，请使用HTML解析器。 Python附带html.parser，您也可以使用Beautiful Soup或html5lib。所有你需要做的就是走树并删除换行符。

Answer 2

您应该尝试使用BeautifulSoup（bs4），这将允许您解析XML标记和页面。

>>> import bs4
>>> my_string = "<p>this is some \n fun</p>And this is \n some more fun!"
>>> soup = bs4.BeautifulSoup(my_string)
>>> p = soup.p.contents[0].replace('\n ','')
>>> print p

这将拉出p标签中的新行。如果内容包含多个标记，则可以使用None以及for循环，然后收集子项（使用tag.child属性）。

例如：

>>> tags = soup.find_all(None)
>>> for tag in tags:
...    if tag.child is None:
...        tag.child.contents[0].replace('\n ', '')
...    else:
...        tag.contents[0].replace('\n ', '')

尽管如此，这可能与您想要的方式无关（因为网页可能会有所不同），但可以根据您的需要重现此代码。

如果它们出现在html标记内，我该如何删除换行符？

2 个答案: