Question

def main():
    client = ##client_here
    db = client.brazil
    rio_bus = client.tweets
    result_cursor = db.tweets.find()
    first = result_cursor[0]
    ordered_fieldnames = first.keys()
    with open('brazil_tweets.csv','wb') as csvfile:

        csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
        csvwriter.writeheader()
        for x in result_cursor:
            print x
            csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})

        #[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]


if __name__ == '__main__':
    main()

基本上问题是推文包含一堆葡萄牙语字符。我尝试通过将所有内容编码为unicode值进行纠正，然后将它们放入要添加到行中的字典中。然而，这不起作用。格式化这些值的任何其他想法，以便csv阅读器和dictreader可以读取它们吗？

Answer 1

ascii就是问题所在。

>>> x = u'résumé' >>> str(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)将使用Python 2中的默认.encode()编解码器将Unicode字符串转换为字节字符串：

str()

非布尔值（如布尔值）将转换为字节字符串，但Python会在调用>>> class Test(object): ... def __str__(self): ... return 'r\xc3\xa9sum\xc3\xa9' ... >>> x=Test() >>> str(x) 'r\xc3\xa9sum\xc3\xa9' >>> str(x).encode('utf8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)之前将字节字符串隐式解码为Unicode字符串，因为您只能编码Unicode字符串。这通常不会导致错误，因为大多数非Unicode对象都具有ASCII表示。这是一个自定义对象返回非ASCII str()表示的示例：

unicode(x[k]).encode('utf-8')

请注意，上面是解码错误而不是编码错误。

如果>>> x = True >>> unicode(x) u'True' >>> unicode(x).encode('utf8') 'True' >>> x = u'résumé' >>> unicode(x).encode('utf8') 'r\xc3\xa9sum\xc3\xa9'仅用于将布尔值强制转换为字符串，则将其强制转换为Unicode字符串：

.value

非Unicode值将转换为Unicode字符串，然后可以正确编码，但Unicode字符串将保持不变，因此它们也将被正确编码。

<input id="myday" />
<script>
  document.getElementById('myday').value = 'foo'
</script>

P.S。 Python 3不会在字节和Unicode字符串之间进行隐式编码/解码，并且更容易发现这些错误。

UnicodeEncodeError：＆＃39; ascii＆＃39;编解码器不能对字符u＆＃39; \ xfa＆＃39;进行编码。位置42：序数不在范围内（128）

1 个答案: