Question

我正在和Django一起写博客应用。我想让评论编写者使用一些标签（例如<strong>，a等等）但禁用所有其他标签。

另外，我想让他们把代码放在＆lt; code＆gt;标签，并有pygments解析它们。

例如，某人可能会写下此评论：

I like this article, but the third code example <em>could have been simpler</em>:

<code lang="c">
#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code>

问题是，当我使用BeautifulSoup解析注释以去除不允许的HTML标记时，它还会解析＆lt; code＆gt;的内部。块和对待＆lt; stdbool.h＆gt;和＆lt; stdio.h＆gt;好像它们是HTML标签。

我怎么能告诉BeautifulSoup不要解析＆lt; code＆gt;块？也许还有其他HTML解析器可以更好地完成这项工作？

Answer 1

问题在于<code>是根据HTML标记的常规规则处理的，<code>标记内的内容仍然是HTML（标记主要用于驱动CSS格式，而不是更改解析规则）。

您要做的是创建一种与HTML非常相似但不完全相同的标记语言。简单的解决方案是假设某些规则，例如“<code>和</code>必须单独出现在一行上”，并自行进行一些预处理。

一种非常简单 - 虽然不是100％可靠 - 的技术是将^<code>$替换为<code><![CDATA[，将^</code>$替换为]]></code>。它并不完全可靠，因为如果代码块包含]]>，那么事情就会出现严重错误。
更安全的选择是将代码块（<，>和&内的危险字符替换为等效的字符实体引用（<，{{} 1}}和>）。您可以将标识的每个代码块传递给&。

完成预处理后，照常将结果提交给BeautifulSoup。

Answer 2

来自Python wiki

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)

Answer 3

很遗憾，无法阻止BeautifulSoup解析代码块。

您希望实现的目标之一就是

1）删除代码块

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2）进行常规解析以去除不允许的标签。

3）重新插入代码块并重新生成html。

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

我会回答一些代码，但我最近读了一篇优雅的博客。

http://iboris.com/page/add-source-code-syntax-highlighting-your-django-content-pygments.html

Answer 4

编辑：

使用python-markdown2处理输入，让用户缩进代码区域。

>>> print html
I like this article, but the third code example <em>could have been simpler</em>:

    #include <stdbool.h>
    #include <stdio.h>

    int main()
    {
        printf("Hello World\n");
    }

>>> import markdown2
>>> marked = markdown2.markdown(html)
>>> marked
u'<p>I like this article, but the third code example <em>could have been simpler</em>:</p>\n\n<pre><code>#include &lt;stdbool.h&gt;\n#include &lt;stdio.h&gt;\n\nint main()\n{\n    printf("Hello World\\n");\n}\n</code></pre>\n'
>>> print marked
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>

<pre><code>#include &lt;stdbool.h&gt;
#include &lt;stdio.h&gt;

int main()
{
    printf("Hello World\n");
}
</code></pre>

如果您仍需要使用BeautifulSoup进行导航和编辑，请执行以下操作。如果您需要'＆lt;'，请包含实体转化和'＆gt;'要重新插入（而不是'＆lt;'和'＆gt;'）。

soup = BeautifulSoup(marked, 
                     convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> soup
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code></pre>


def thickened(soup):
    """
    <code>
    blah blah <entity> blah
        blah
    </code>
    """
    codez = soup.findAll('code') # get the code tags
    for code in codez:
        # take all the contents inside of the code tags and convert
        # them into a single string
        escape_me = ''.join([k.__str__() for k in code.contents])
        escaped = cgi.escape(escape_me) # escape them with cgi
        code.replaceWith('<code>%s</code>' % escaped) # replace Tag objects with escaped string
    return soup

Answer 5

如果<code>元素在代码中包含未转义的<，&，>个字符而不是有效的html。 BeautifulSoup会尝试将其转换为有效的HTML。这可能不是你想要的。

要将文本转换为有效的html，您可以调整a regex that strips tags from an html以从<code>块中提取文本，并将其替换为cgi.escape()版本。如果没有嵌套的<code>标记，它应该可以正常工作。之后，您可以将已清理的html提供给BeautifulSoup。

使用BeautifulSoup解析文档而不解析<code> tags</code>的内容

5 个答案:

编辑：