Question

是否有一个纯Python工具来获取一些HTML并将其截断为尽可能接近给定长度，但是要确保生成的代码段格式正确吗？例如，给定此HTML：

<h1>This is a header</h1>
<p>This is a paragraph</p>

它不会产生：

<h1>This is a hea

但：

<h1>This is a header</h1>

或至少：

<h1>This is a hea</h1>

我找不到一个有效的，虽然我找到了一个依赖于pullparser的，它已经过时而且已经死了。

Answer 1

如果你正在使用DJANGO lib，你可以简单地说：

from django.utils import text, html

    class class_name():


        def trim_string(self, stringf, limit, offset = 0):
            return stringf[offset:limit]

        def trim_html_words(self, html, limit, offset = 0):
            return text.truncate_html_words(html, limit)


        def remove_html(self, htmls, tag, limit = 'all', offset = 0):
            return html.strip_tags(htmls)

无论如何，这里是来自django的truncate_html_words的代码：

import re

def truncate_html_words(s, num):
    """
    Truncates html to a certain number of words (not counting tags and comments).
    Closes opened tags if they were correctly closed in the given html.
    """
    length = int(num)
    if length <= 0:
        return ''
    html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input')
    # Set up regular expressions
    re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)')
    re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>')
    # Count non-HTML words and keep note of open tags
    pos = 0
    ellipsis_pos = 0
    words = 0
    open_tags = []
    while words <= length:
        m = re_words.search(s, pos)
        if not m:
            # Checked through whole string
            break
        pos = m.end(0)
        if m.group(1):
            # It's an actual non-HTML word
            words += 1
            if words == length:
                ellipsis_pos = pos
            continue
        # Check for tag
        tag = re_tag.match(m.group(0))
        if not tag or ellipsis_pos:
            # Don't worry about non tags or tags after our truncate point
            continue
        closing_tag, tagname, self_closing = tag.groups()
        tagname = tagname.lower()  # Element names are always case-insensitive
        if self_closing or tagname in html4_singlets:
            pass
        elif closing_tag:
            # Check for match in open tags list
            try:
                i = open_tags.index(tagname)
            except ValueError:
                pass
            else:
                # SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags
                open_tags = open_tags[i+1:]
        else:
            # Add it to the start of the open tags list
            open_tags.insert(0, tagname)
    if words <= length:
        # Don't try to close tags if we don't need to truncate
        return s
    out = s[:ellipsis_pos] + ' ...'
    # Close any tags still open
    for tag in open_tags:
        out += '</%s>' % tag
    # Return string
    return out

Answer 2

我认为你不需要一个成熟的解析器 - 你只需要将输入字符串标记为以下之一：

文本
打开标签
关闭标记
自动关闭标签
字符实体

一旦有了这样的令牌流，就可以很容易地使用堆栈来跟踪需要关闭的标记。我实际上不久前遇到了这个问题并编写了一个小型库来执行此操作：

https://github.com/eentzel/htmltruncate.py

它对我来说效果很好，可以很好地处理大多数边角情况，包括任意嵌套标记，将字符实体计为单个字符，返回错误标记错误等。

它会产生：

<h1>This is a hea</h1>

在你的例子上。这可能会改变，但在一般情况下很难 - 如果你试图截断到10个字符，但是<h1>标签没有关闭另一个，比如说300个字符怎么办？

Answer 3

您可以使用BeautifulSoup在一行中执行此操作（假设您要截断一定数量的源字符，而不是截断多个内容字符）：

from BeautifulSoup import BeautifulSoup

def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length]))

Answer 4

我发现slacy的答案非常有用，并且如果我有声誉的话，它会支持它 - 但是还有一件事需要注意。在我的环境中，我安装了html5lib以及BeautifulSoup4。 BeautifulSoup使用了html5lib解析器，这导致我的html片段被包装在html和body标签中，这不是我想要的。

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<html><head></head><body><p>s</p></body></html>'

为解决这些问题，我告诉BeautifulSoup使用python解析器：

from bs4 import BeautifulSoup
def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length], "html.parser"))

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<p>s</p>'

Answer 5

这将满足您的要求。易于使用的HTML解析器和错误的标记校正器

http://www.crummy.com/software/BeautifulSoup/

Answer 6

我最初的想法是使用XML解析器（可能是python's sax parser），然后可能会计算每个xml元素中的文本字符。我会忽略标签字符计数，使其更加一致和简单，但要么是可能的。

Answer 7

我建议先完全解析HTML然后截断。一个伟大的python HTML解析器是lxml。解析和截断后，您可以将其打印回HTML格式。

Answer 8

查看HTML Tidy以清理/重新格式化/重新输入HTML。

HTML在Python中截断

8 个答案: