Question

我有这样的文字：

text = """<div>
<h1>Title</h1>
<p>A long text........ </p>
<a href=""> a link </a>
</div>"""

使用纯Python，没有外部模块我想要这个：

>>> print remove_tags(text)
Title A long text..... a link

我知道我可以使用 lxml.html.fromstring（text）.text_content（）来实现它，但我需要在纯Python中使用内置或std库实现相同的2.6 +

我该怎么做？

Answer 1

使用正则表达式

使用正则表达式，您可以清除<>内的所有内容：

import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

使用BeautifulSoup

您还可以使用BeautifulSoup附加包查找所有原始文本

调用BeautifulSoup时需要显式设置解析器我建议在替代答案中提到“lxml”（比默认答案更强大（即没有额外安装时可用）'html.parser'

from bs4 import BeautifulSoup
cleantext = BeautifulSoup(raw_html, "lxml").text

但它不会阻止您使用外部库，所以我建议使用第一个解决方案。

Answer 2

Python内置了几个XML模块。对于你已经拥有完整HTML字符串的情况，最简单的一个是xml.etree，它与你提到的lxml示例类似地工作：

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

Answer 3

请注意，这并不完美，因为如果您有类似的事情，比如<a title=">">，它就会破裂。但是，如果没有非常复杂的函数，它就是你在非库Python中最接近的：

import re

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

但是，由于lvc提及{标准库}中提供了xml.etree，因此您可能只需将其调整为与现有lxml版本一样的服务：

def remove_tags(text):
    return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

Answer 4

在任何类C语言中都有一种简单的方法。风格不是Pythonic，而是使用纯Python：

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

这个想法基于一个简单的有限状态机，详细解释如下：http://youtu.be/2tu9LTDujbw

您可以在此处看到它：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果您对该课程感兴趣（关于使用python进行智能调试），我会给你一个链接：http://www.udacity.com/overview/Course/cs259/CourseRev/1。免费！

Answer 5

global temp

temp =''

s = ' '

def remove_strings(text):

    global temp 

    if text == '':

        return temp

    start = text.find('<')

    end = text.find('>')

    if start == -1 and end == -1 :

        temp = temp + text

    return temp

newstring = text[end+1:]

fresh_start = newstring.find('<')

if newstring[:fresh_start] != '':

    temp += s+newstring[:fresh_start]

remove_strings(newstring[fresh_start:])

return temp

用于从字符串中删除HTML标记的Python代码

5 个答案:

使用正则表达式

使用BeautifulSoup