Question

我正在使用一些在尝试从包含非ascii字符的html创建pdf时产生错误的软件。我创建了一个更简单的程序来重现问题并帮助我理解发生了什么。

#!/usr/bin/python
#coding=utf8
from __future__ import unicode_literals
import pdfkit
from pyPdf import PdfFileWriter, PdfFileReader
f = open('test.html','r')
html = f.read()
print html
pdfkit.from_string(html, 'gen.pdf')
f.close()

运行此程序会导致：

<html>
<body>
<h1>ر</h1>
</body>
</html>

Traceback (most recent call last):
  File "./testerror.py", line 10, in <module>
    pdfkit.from_string(html, 'gen.pdf')
  File "/usr/local/lib/python2.7/dist-packages/pdfkit/api.py", line 72, in from_string
    return r.to_pdf(output_path)
  File "/usr/local/lib/python2.7/dist-packages/pdfkit/pdfkit.py", line 136, in to_pdf
    input = self.source.to_s().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)

我尝试添加一个替换语句来删除问题字符，但这也导致了错误：

Traceback (most recent call last):
  File "./testerror.py", line 9, in <module>
    html = html.replace('ر','-')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 18: ordinal not in range(128)

恐怕我不太了解ascii / utf-8编码。如果有人能帮我理解这里发生了什么，那就太好了！我不确定这是否是pdf库中的问题，或者这是由于我对编码的无知：）

Answer 1

阅读pdfkit源代码，似乎pdfkit.from_string期望其第一个参数为unicode而不是str，因此您可以正确解码{{1} }。为此，您必须知道html文件的编码。一旦你知道你必须继续：

test.html

请注意，with open('test.html') as f: html = f.read().decode('<your-encoding-name-here>) pdfkit.from_string(html, 'gen.pdf')将返回str.decode(<encoding>)字符串，unicode将返回一个字节字符串，Iow你unicode.encode(<encoding>)从字节字符串到unicode，你decode从unicode到字节串。

在您的情况下，也可以使用encode代替codecs.open(path, mode, encoding) +显式解码，即：

file.open()

作为旁注：

读取（读取import codecs with codecs.open('test.html', encoding=<your-encoding-name-here>) as f: html = f.read() # `codecs` while do the decoding behind the scene的二进制文件，但这是一个实现细节）是打开文件时的默认模式，因此无需全部指定
使用文件作为上下文管理器（codecs）可确保文件正确关闭。虽然CPython通常会在收集with open(path) as f: ...个对象时关闭已打开的文件，但这是一个实现细节，不由语言保证，所以不要依赖它。
< / LI>

Answer 2

HTML也应该包含charset

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>

尝试使用非ascii字符生成pdf时出现Unicode解码错误

2 个答案: