Question

我正在使用python 3.5，在某些情况下，当我调用tidylib.tidy_document时在HTML文件中，＆lt; link ../>末尾的'/'字符;标签在标题已被删除。 Tidylib在发布时不会给出任何错误或警告它删除了这个角色。

我正在使用的HTML文件是使用writer2epub生成的Epub的一部分。该此Epub中几乎所有文件都出现错误。唯一的例外是非常简短的（例如文件的标题页）。总的来说错误是一样的受影响的文件。

我怀疑使用回车符（0x0d）而不是问题换行（0x0a），但更改它们并没有什么区别。我也看到该文件包含各种其他非ASCII字符，所以也许他们应该受到责备。使用tidylib搜索unicode问题并未发现任何与此问题相关的内容。

我上传了一个test file，可以使用以下代码重现问题：

import re
from tidylib import tidy_document



def printLink(html):
    """ Print the <link> tag from the HTML header """
    for line in html.split('\n'):
        match = re.search('<link[^>]+>', line)
        if match is not None:
            print(match.group(0))



if __name__ == '__main__':
    fname = 'test04.xhtml'
    print(fname)
    with open(fname, 'r') as fh:
        html = fh.read()

    print('checkpoint 01')
    printLink(html)
    newHtml, errors = tidy_document(html)
    print('checkpoint 02')
    printLink(newHtml)

如果再现问题，输出将为：

＆lt; link rel =“stylesheet”href =“../ styles / style001.css”type =“text / css”/＆gt;

在检查站01和

＆lt; link rel =“stylesheet”href =“../ styles / style001.css”type =“text / css”＆gt;

在检查站02。

是什么导致tidylib删除这个'/'字符？

tidylib会损坏我的HTML文件吗？

0 个答案: