Question

这似乎是一件简单的事情，但我一直无法找到答案。我正在使用Pandoc从HTML转换为Markdown，我想从HTML中删除所有属性，例如“class”和“id”。

Pandoc中是否有选项可以执行此操作？

Answer 1

考虑input.html：

<h1 class="test">Hi!</h1>
<p><strong id="another">This is a test.</strong></p>

然后，pandoc input.html -t markdown_github-raw_html -o output.md

产生output.md：

Hi!
===

**This is a test.**

如果没有-t markdown_github-raw_html，您将得到

Hi! {#hi .test}
===

**This is a test.**

这个问题实际上类似于this one。我认为pandoc不会保留id属性。

Answer 2

您可以使用 Lua filter 删除所有属性和类。将以下内容保存到文件 remove-attr.lua 并使用 --lua-filter=remove-attr.lua 调用 pandoc。

function remove_attr (x)
  if x.attr then
    x.attr = pandoc.Attr()
    return x
  end
end

return {{Inline = remove_attr, Block = remove_attr}}

Answer 3

我也很惊讶这个看似简单的操作在网络搜索中没有产生任何结果。最终通过参考其他 SO 答案中的 BeautifulSoup 文档和示例用法编写了以下内容。

下面的代码还删除了 script 和 style html 标签。最重要的是，它将保留任何 src 和 href 属性。这两个应该允许灵活地满足您的需求（即适应任何需求，然后使用 pandoc 将返回的 html 转换为 Markdown）。

# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString

def unstyle_html(html):
    soup = BeautifulSoup(html, features="html.parser")

    # remove all attributes except for `src` and `href`
    for tag in soup.descendants:
        keys = []
        if not isinstance(tag, NavigableString):
            for k in tag.attrs.keys():
                if k not in ["src", "href"]:
                    keys.append(k)
            for k in keys:
                del tag[k]

    # remove all script and style tags
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    # return html text
    return soup.prettify()

Pandoc - HTML to Markdown - 删除所有属性

3 个答案: