Question

我有一个非常长的以下结构的HTML文本：

<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>

现在，让我们说我想将HTML文本修剪为仅1000个字符，但我仍然希望HTML有效，即关闭其结束标记被删除的标记。我该怎么做才能使用Python纠正修剪后的HTML文本？请注意，HTML的结构并不总是如上所述。

我需要这个用于电子邮件广告系列，其中会发送博客预览，但收件人需要访问博客的网址才能看到完整的文章。

Answer 1

BeautifulSoup怎么样？（蟒-BS4）

from bs4 import BeautifulSoup

test_html = """<div>
    <div>
        <p>Paragraph 1 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 2 Lorem ipsum dolor... long text... </p>
        <p>Paragraph 3 Lorem ipsum dolor... long text... </p>
    </div>
</div>"""

test_html = test_html[0:50]
soup = BeautifulSoup(test_html, 'html.parser')

print(soup.prettify())

.prettify（）应自动关闭标记。

Answer 2

我可以举一个例子。如果它看起来像这样：

＆＃13;

<div>
  <p>Long text...</p>
  <p>Longer text to be trimmed</p>
</div>

＆＃13;

你有一个Python代码：

def TrimHTML(HtmlString):
    result = []
    newlinesremaining = 2 # or some other value of your choice
    foundlastpart = False
    for x in list(HtmlString): # being HtmlString the html to be trimmed
        if not newlinesremaining < 1:
            if x == '\n':
                newlinesremaining -= 1
            result.append(x)
        elif foundlastpart == False:
            if x == \n:
                newlinesremaining = float('inf')
                foundlastpart == True
        return result.join('')

并运行该代码在函数中输入上面的示例HTML，然后函数返回：

＆＃13;

<div>
  <p>Long text...</p>
</div>

＆＃13;

出于一些奇怪的原因，我无法在工作前的短时间内测试它。

使用Python删除部分HTML文本

2 个答案: