Question

我正在寻找一个html页面，只是在该页面上提取纯文本。有人知道在python中这样做的好方法吗？

我想从字面上删除所有内容，只留下文章的文本以及标签之间的其他文本。 JS，css等...已经消失了

谢谢！

Answer 1

如果第一个答案在页面中（未链接），则不会删除CSS或JavaScript标记的正文。这可能会更接近：

def stripTags(text):
  scripts = re.compile(r'<script.*?/script>')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

  return text

Answer 2

您可以尝试相当优秀的Beautiful Soup

f = open("my_source.html","r")
s = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(s)
txt = soup.body.getText()

但要注意：从任何解析尝试中得到的回报都将受到“错误”的影响。糟糕的HTML，糟糕的解析和一般的意外输出。如果您的源文档是众所周知的并且很好地呈现，那么您应该没问题，或者至少能够解决其中的特性问题，但如果它只是“在互联网上”发现的一般内容，那么期待各种奇怪和奇妙的异常值。

Answer 3

根据here：

def remove_html_tags(data):
     p = re.compile(r'<.*?>')
     return p.sub('', data)

正如他在文章中指出的那样，“需要导入”re模块才能使用正则表达式。“

Answer 4

lxml.html模块值得考虑。但是，删除CSS和JavaScript需要一些按摩：

def stripsource(page):
    from lxml import html

    source = html.fromstring(page)
    for item in source.xpath("//style|//script|//comment()"):
        item.getparent().remove(item)

    for line in source.itertext():
        if line.strip():
            yield line

可以简单地连接所产生的线，但这可能会失去意义字边界，如果在空白生成周围没有任何空格标签

您可能还想迭代<body>标记，具体取决于您的要求。

Answer 5

我也会推荐BeautifulSoup，但我建议使用类似于this question的答案，我会在这里复制那些不想看的人：

soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)

我在这个页面上试过它，但效果很好。

Answer 6

这是我发现剥离CSS和JavaScript 时最干净，最简单的解决方案：

''.join(BeautifulSoup(content).findAll(text=lambda text: 
text.parent.name != "script" and 
text.parent.name != "style"))

https://stackoverflow.com/a/3002599/1203188

Matthew Flaschen

从网页中删除除文本之外的所有内容的最佳方法是什么？

6 个答案: