Question

我正在使用feedparser开发脚本（以提取RSS源）。使用一些函数我最终得到一个名为description的字符串，如下所示：

"This is the description of the feed. < img alt='' height='1' src='http://linkOfARandomImage.of/the/feed' width='1' />"

html标签可以变化，我可以有img，一个href，“p”，“h1”，......而ammount也可能有所不同。所以他们很随机。但我想做的只是保留第一个文本。我想知道是否有办法删除所有标签，我正在考虑做一些事情：从这个角色“＆lt;”到最后，删除一切。但是有可能存在“＆lt;”在描述的中间。希望你能得到我想做的事。感谢

Answer 1

删除所有标签：

import re
text = "This is the description of <img alt='' height='1' src='http://linkOfARandomImage.of/the/feed' width='1' /> the <br> text"
text = re.sub("<.*?>", "", text)
#text = "This is the description of  the  text"

删除不必要的空格：

text = re.sub("\w*", " ", text)

编辑：

text = re.sub("\w+", " ", text)

Answer 2

如果您只想删除第一个文本（在显示任何标记之前），则无需使用正则表达式。

只需使用split和strip。

>>> html = "Some text here <tag>blabla</tag> <other>hey you</other>"
>>> text = html.split("<")[0].strip()
>>> text
"Some text here"

split在遇到指定字符时会剪切html字符串。

strip删除结果字符串开头和结尾的所有空格。

警告：只有在您要保留的文字中没有<时，此功能才有效。

从字符串Python中删除HTML标签

2 个答案: