Question

如何从Python中删除字符串中的所有HTML？例如，我该怎么转：

blah blah <a href="blah">link</a>

到

blah blah link

谢谢！

Answer 1

当您的正则表达式解决方案遇到障碍时，请尝试这个超级简单（且可靠）的BeautifulSoup程序。

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

Answer 2

还有一个名为stripogram的小型库，可用于删除部分或全部HTML标记。

你可以像这样使用它：

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

因此，如果您想简单地删除所有HTML，请将valid_tags =（）传递给第一个函数。

您可以找到documentation here。

Answer 3

您可以使用正则表达式删除所有标记：

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'

Answer 4

如果属性中包含“>”，则

正则表达式，BeautifulSoup，html2text 无效。见Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'基于HTML / XML解析器'的解决方案在这种情况下可能有所帮助，例如，stripogram suggested by @MrTopf确实有效。

这是基于ElementTree的解决方案：

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

输出：

blah blah link END

Answer 5

试试Beautiful Soup。抛弃除文本之外的所有内容。

Answer 6

html2text会做这样的事情。

Answer 7

我刚写了这个。我需要它。它使用html2text并采用文件路径，虽然我更喜欢URL。 html2text的输出存储在TextFromHtml2Text.text中打印，存放，喂给你的宠物金丝雀。

import html2text
class TextFromHtml2Text:

    def __init__(self, url = ''):
        if url == '':
            raise TypeError("Needs a URL")
        self.text = ""
        self.url = url
        self.html = ""
        self.gethtmlfile()
        self.maytheswartzbewithyou()

    def gethtmlfile(self):
        file = open(self.url)
        for line in file.readlines():
            self.html += line

    def maytheswartzbewithyou(self):
        self.text = html2text.html2text(self.html)

Answer 8

有一种简单的方法：

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

这个想法在这里解释：http://youtu.be/2tu9LTDujbw

您可以在此处看到它：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果您对该课程感兴趣（关于使用python进行智能调试），我会给你一个链接：http://www.udacity.com/overview/Course/cs259/CourseRev/1。免费！

欢迎你！：）

Answer 9

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> q = re.compile(r'<.*?>', re.IGNORECASE)
>>> re.sub(q, '', s)
'blah blah link'

Python HTML删除

9 个答案: