Question

我正在编写一个程序来迭代我的Robocopy-Log（> 25 MB）。到目前为止还没准备好，因为我遇到了问题。

问题是在迭代~1700行我的日志后 - >我得到一个“UnicodeError”：

Traceback (most recent call last):
  File "C:/Users/xxxxxx.xxxxxx/SkyDrive/#Python/del_robo2.py", line 6, in <module>
    for line in data:
  File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7869: character maps to <undefined>

该计划如下：

x="Error"
y=1
arry = []
data = open("Ausstellungen.txt",mode="r")
for line in data:
    arry = line.split("\t")
    print(y)
    y=y+1
    if x in arry:
        print("found")
        print(line)
data.close()

如果我将txt文件减少到1000行，则程序可以正常工作。如果我删除第1500行到3000并再次运行，我会再次在第1700行附近发出unicode错误。

所以我犯了错误或者这是Python的内存限制问题吗？

Answer 1

鉴于您的数据和片段，如果这是一个内存问题，我会感到惊讶。编码更可能是：Python使用系统的默认编码来读取文件，即“cp1252”（默认的MS Windows编码），但该文件包含无法在该编码中解码的字节序列/字节。该文件的实际编码的候选者可能是“latin-1”，您可以通过说

来使用Python 3

open("Ausstellungen.txt",mode="r", encoding="latin-1")

可能类似的问题是Python 3 chokes on CP-1252/ANSI reading。关于整件事的一个很好的讨论在这里：http://nedbatchelder.com/text/unipain.html

Answer 2

Python将所有文件数据解码为Unicode值。您没有指定要使用的编码，因此Python使用系统的默认值cp1252 Windows Latin codepage。

但是，这是您的文件数据的错误编码。您需要指定要使用的显式编解码器：

data = open("Ausstellungen.txt",mode="r", encoding='UTF8')

不幸的是，要使用哪种编码，您需要自己弄清楚。我使用UTF-8作为示例编解码器。

请注意some versions of RoboCopy have problems producing valid output。

如果您还不知道Unicode是什么，或想了解编码，请参阅：

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Python Unicode HOWTO
Pragmatic Unicode

您看到文件的不同部分出现错误的原因是您的数据包含多个cp1252编码无法处理的代码点。

Python3：为什么我得到UnicodeDecodeError或者这是一个内存问题？

2 个答案: