Question

我正在尝试在python3维基百科数据库dump file中打开。我使用gzip命令在linux中解压缩此文件并尝试使用以下代码打开：

#!/usr/bin/env python
# -*- coding: utf-8 -*

with open('dump.sql', 'r') as file:
        for i in file:
                print(i)

但它给了我这个错误：

  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 250-251: invalid continuation byte

Linux命令file -i dump.sql显示utf8字符集。哪里可以成问题？

我发现了更多信息here，但此文件来自4.7.2017，所以这不是问题。

由于早期MediaWiki版本（2004年左右）中的lenient charset验证，转储可能在旧文本修订中包含非Unicode（UTF8）字符。例如，zhwiki-20130102-langlinks.sql.gz包含一些复制和粘贴的iso8859-1“ö”字符;由于langlinks表是在解析时生成的，因此对页面的空编辑或forcelinkupdate足以修复它。

那么如何在python中处理维基百科数据库转储文件？

维基百科数据库转储 - UTF8字符集

0 个答案: