我尝试使用httplib.request
函数发布unicode数据:
s = u"עברית"
data = """
<spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
<text>%s</text>
</spellrequest>
""" % s
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=he", data)
response = con.getresponse().read()
然而,这是我的错误:
Traceback (most recent call last):
File "C:\Scripts\iQuality\test.py", line 47, in <module>
print spellFix(u"╫á╫נ╫¿╫ץ╫ר╫ץ")
File "C:\Scripts\iQuality\test.py", line 26, in spellFix
con.request("POST", "/tbproxy/spell?lang=%s" % lang, data)
File "C:\Python27\lib\httplib.py", line 955, in request
self._send_request(method, url, body, headers)
File "C:\Python27\lib\httplib.py", line 989, in _send_request
self.endheaders(body)
File "C:\Python27\lib\httplib.py", line 951, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 815, in _send_output
self.send(message_body)
File "C:\Python27\lib\httplib.py", line 787, in send
self.sock.sendall(data)
File "C:\Python27\lib\ssl.py", line 220, in sendall
v = self.send(data[count:])
File "C:\Python27\lib\ssl.py", line 189, in send
v = self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 97-102: or
dinal not in range(128)
我哪里错了?
答案 0 :(得分:9)
http不是根据特定字符编码定义的,而是使用八位字节。您需要将数据转换为编码,然后您需要告诉服务器您使用了哪种编码。让我们使用utf8,因为它通常是最好的选择:
此数据看起来有点像XML,但您正在跳过xml标记。有些服务可能会接受,但你不应该这样做。实际上,编码实际上属于那里;所以一定要包括它。标题看起来像<?xml version="1.0" encoding="
编码 "?>
。
s = u"עברית"
data_unicode = u"""<?xml version="1.0" encoding="UTF-8"?>
<spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
<text>%s</text>
</spellrequest>
""" % s
data_octets = data_unicode.encode('utf-8')
出于礼貌,您还应该使用content-type
标题告诉服务器本身格式和编码:
con = httplib.HTTPSConnection("www.google.com")
con.request("POST",
"/tbproxy/spell?lang=he",
data_octets, {'content-type': 'text/xml; charset=utf-8'})
编辑:它在我的机器上工作正常,你确定你没有跳过某些东西吗?完整的例子
>>> from cgi import escape
>>> from urllib import urlencode
>>> import httplib
>>>
>>> template = u"""<?xml version="1.0" encoding="UTF-8"?>
... <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
... <text>%s</text>
... </spellrequest>
... """
>>>
>>> def chkspell(word, lang='en'):
... data_octets = (template % escape(word)).encode('utf-8')
... con = httplib.HTTPSConnection("www.google.com")
... con.request("POST",
... "/tbproxy/spell?" + urlencode({'lang': lang}),
... data_octets,
... {'content-type': 'text/xml; charset=utf-8'})
... req = con.getresponse()
... return req.read()
...
>>> chkspell('baseball')
'<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="8"></spellresult>'
>>> chkspell(corpus, 'he')
'<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="5"></spellresult>'
我注意到,当我粘贴您的示例时,它在我的终端上显示的顺序与我在浏览器中显示的顺序相反。考虑到希伯来语是一种从右到左的语言,这并不奇怪。
>>> corpus = u"עברית"
>>> print corpus[0]
ע