Question

stackoverflow上有几个线程，但我找不到整个问题的有效解决方案。

我从urllib读取函数中收集了大量文本数据，并将其存储在pickle文件中。

现在我想将此数据写入文件。写作时我得到的错误类似于 -

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)

并且丢失了大量数据。

我认为urllib读取的数据是字节数据

我试过

   1. text=text.decode('ascii','ignore')
   2. s=filter(lambda x: x in string.printable, s)
   3. text=u''+text
      text=text.decode().encode('utf-8')

但我仍然以类似的错误结束。有人可以指出一个合适的解决方案。并且编解码器也会剥离工作。如果冲突字节没有作为字符串写入文件，那么我就没有问题，因此可以接受丢失。

Answer 1

您可以通过smart_str Django模块进行此操作。试试这个：

from django.utils.encoding import smart_str, smart_unicode

text = u'\u2019'
print smart_str(text)

您可以通过启动具有管理员权限的命令shell来安装Django并运行以下命令：

pip install Django

Answer 2

您的数据是 unicode 数据。要将其写入文件，请使用.encode()：

text = text.encode('ascii', 'ignore')

但是这会删除任何非ASCII的内容。也许你想编码为更合适的编码，比如UTF-8，而不是？

您可能想要阅读Python和Unicode：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Python Unicode HOWTO
Pragmatic Unicode

'ascii'编解码器不能编码位置* ord不在范围内的字符（128）

2 个答案: