我正在使用invoice2data库解析发票。该库在YAML中具有预定义的模板,用于解析发票。但是,当我运行示例时,它给我所有模板的YAML解析错误
将其运行为:
invoice2data --input-reader tesseract FlipkartInvoice.pdf
例外:
Traceback (most recent call last):
File "/home/webwerks/.local/bin/invoice2data", line 10, in <module>
sys.exit(main())
File "/home/webwerks/.local/lib/python3.5/site-packages/invoice2data/main.py", line 191, in main
templates += read_templates()
File "/home/webwerks/.local/lib/python3.5/site-packages/invoice2data/extract/loader.py", line 88, in read_templates
tpl = ordered_load(template_file.read())
File "/home/webwerks/.local/lib/python3.5/site-packages/invoice2data/extract/loader.py", line 36, in ordered_load
return yaml.load(stream, OrderedLoader)
File "/usr/local/lib/python3.5/dist-packages/yaml/__init__.py", line 112, in load
loader = Loader(stream)
File "/usr/local/lib/python3.5/dist-packages/yaml/loader.py", line 44, in __init__
Reader.__init__(self, stream)
File "/usr/local/lib/python3.5/dist-packages/yaml/reader.py", line 74, in __init__
self.check_printable(stream)
File "/usr/local/lib/python3.5/dist-packages/yaml/reader.py", line 144, in check_printable
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0082: special characters are not allowed
in "<unicode string>", position 312
最后一行说:
File "/usr/local/lib/python3.5/dist-packages/yaml/reader.py", line 144, in check_printable
'unicode', "special characters are not allowed")
yaml.reader.ReaderError: unacceptable character #x0082: special characters are not allowed
in "<unicode string>", position 312
我已经检查了模板。全部以UTF-8格式有效。
问题似乎与 python-yaml 软件包有关。有人遇到过此问题吗?
答案 0 :(得分:3)
您输入的内容是有效的UTF-8是无关紧要的,因为YAML源应仅接受Unicode代码点的子集(独立于UTF-8或其他某种编码)。
特别是它仅支持Unicode的 printable 子集和旧的 PyYAML支持的YAML 1.1 specification对此进行了详细说明:
允许的字符范围明确排除了代理块#xD800-#xDFFF,DEL#x7F,C0控制块#x0-#x1F(#x9,#xA和#xD除外),C1控制块# x80-#x9F,#xFFFE和#xFFFF。任何此类字符都必须使用转义序列显示。
因此,显然不允许使用不可打印的“ BREAK PERMITTED HERE”代码点0x0082
(而且PyYAML不允许这样做,但不是其中之一)。