Question

我正在使用lxml的html5parser ascii字符没关系，但是如果我下载一个内部有波斯语和俄语字符的html文件，则会出现此错误：

0=>[INFO ] | 2016-11-28 10:56:19.68 | level to "Info"
1=>[INFO ] | 2016-11-28 10:56:56.93 | to "Info"
2=>[DEBUG ] | 2016-11-28 10:56:56.93 | been initialized successfully.
3=>[INFO ] | 2016-11-28 11:01:14.05 | to "Info"
More info in second line
[IRRELEVANT TAG] | Noone knows what this is | "Whatever"
4=>[ERROR ] | 2016-11-28 11:01:14.05 | initialized successfully.

这是回复文字：http://paste.ubuntu.com/23552349/

这是我的代码（如您所见，我只删除了所有无效的xml字符）：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 418: ordinal not in range(128)

如果我删除该行：resp = encode（“utf-8”）将出现此错误：

ValueError：所有字符串必须兼容XML：Unicode或ASCII，无NULL字节或控制字符

Answer 1

直接使用html5parser（TypeError: __init__() got an unexpected keyword argument 'useChardet'之类的东西）时，我也会遇到一些奇怪的不一致。

如果你已经安装了lxml，那么使用BeautifulSoup包装器是一种乐趣。

首先安装BeautifulSoup（pip install beautifulsoup4）。然后：

import requests
from bs4 import BeautifulSoup

# (initialize headers, cookies and data)

f = requests.post('http://www.example.com/getHtml.php?', headers=headers, cookies=cookies, data=data)
resp = f.text
if not resp:
    return []
doc = BeautifulSoup(resp, 'lxml')

然后，您可以使用BeautifulSoup clean API来操作HTML树。在引擎盖下，它仍然使用lxml进行解析。

参考BeautifulSoup API：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Answer 2

resp = ''.join(c for c in resp if valid_xml_char_ordinal(c))

这种过滤掉坏字符的尝试不起作用，因为输入中的控制字符实际上被编码为数字字符引用，而不是原始字符：

<td class="artistFlux">السيف النشيد الدولة الإسلامية التي من شأن&#16</td>

具体而言&#16（此处由右到左文字模糊）。 U + 0010（16）等控制字符在HTML5 even as character references中无效。

最好能够修复产生这种残余的上游脚本，但是如果你必须从输入中删除这样的bum字符引用，你可以再做一个过滤器来删除像&#(3[01]|2[0-9]|1[124-9]|[0-8]])(?=[^0-9])这样的正则表达式。< / p>

顺便说一句，你不需要正常编码和解码。您可以从f.content读取响应的原始字节，并将其直接提供给html5parser，以避免将响应解码为text，然后将其重新编码为字节。您可能还需要fragments_fromstring复数，因为您的输入中有两个顶级元素。

lxml.html5parser：不适用于arabic / persian html5s

2 个答案: