Question

我在Windows 10 64位上使用2GB XML文件时没有取得太大成功。我正在使用Github上发现的一些代码here并设法让它继续运行，但是在特定字符\ u0126上获得了UnicodeErrors，这是一个Ħ（马耳他字母表中使用的字母）。该脚本执行但在保存第一个块并且第二个块启动后，出现错误。

编辑：XML文件是来自本地门户网站的Disqus转储。

我已按照此SO question中的建议进行操作，并在Windows命令提示符中设置chcp 65001和setx PYTHONIOENCODING utf-8，并echo命令检查。

我已经尝试了“可能已经有你的答案的问题”中找到的许多解决方案，但我仍然在同一个字母上得到UnicodeError。我还尝试过粗略data.replace('Ħ', 'H')和data.replace('\\u1026', 'H')，但错误仍然出现并处于同一位置。每次我测试一些新的东西需要大约5分钟，直到错误出现，我一直在努力工作超过一天这种麻烦。

我尝试在Notepad ++ 64位中读取该文件，但程序结束时当我进行搜索时没有响应，因为我的16GB RAM正在被吃掉，系统变得迟钝。

我必须将整个代码的第一行的以下部分更改为：

cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt', encoding='utf-8')

以及第二行：

with open(filename, 'rt', encoding='utf-8') as xml_file:

但仍然没有果汁。我还使用了errors='replace'和errors='ignore'，但无济于事。

cur_file = open(os.path.join(out_dir, root + FMT % cur_idx + ext), 'wt')

with open(filename, 'rt') as xml_file:
    while True:
        # Read a chunk
        chunk = xml_file.read(CHUNK_SIZE)
        if len(chunk) < CHUNK_SIZE:
            # End of file
            # tell the parser we're done
            p.Parse(chunk, 1)
            # exit the loop
            break
        # process the chunk
        p.Parse(chunk)

# Don't forget to close our handle
cur_file.close()

我必须从原始代码编辑的另一行：cur_file.write(data.encode('utf-8'))并且必须将其更改为：

cur_file.write(data)  # .encode('utf-8')) #*

否则执行停止在TypeError: write() argument must be str, not bytes

def char_data(data):
""" Called by the parser when he meet data """
global cur_size, start
wroteStart = False
if start is not None:
    # The data belong to an element, we should write the start part first
    cur_file.write('<%s%s>' % (start[0], attrs_s(start[1])))
    start = None
    wroteStart = True
# ``escape`` is too much for us, only & and < ned to be escaped there ...
data = data.replace('&', '&amp;')
data = data.replace('<', '&lt;')
if data == '>':
    data = '&gt;'
cur_file.write(data.encode('utf-8')) #*
cur_size += len(data)
if not wroteStart:
    # The data was outside of an element, it could be the right moment to
    # make the split
    next_file()

非常感谢任何帮助。

编辑：添加了追溯 尝试编写文件时总会出现问题。

Traceback (most recent call last):
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 249, in <module>
main(args[0], options.output_dir)
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 229, in main
p.Parse(chunk)
File "..\Modules\pyexpat.c", line 282, in CharacterData
File "D:/Users/myself/ProjectForTesting/xml_split.py", line 180, in char_data
cur_file.write(data)  # .encode('utf-8'))
File "C:\Users\myself\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in position 6: character maps to <undefined>

编辑：我已尝试更换Notepad ++中的违规字符，但另一个'\ u200e'已被裁剪，因此替换字符根本不健全。

Answer 1

我一直是个菜鸟。我修改了写入文件命令以使用try: except块，它只是将任何不需要的字符更改为空字符串。我知道这个文件会丢失一些像这样的信息，但至少我可以把它分开看看里面！

这就是我所做的：

try:
cur_file.write(data)  # .encode('utf-8')) # this was part of the original line
except UnicodeEncodeError:
    data = ''
    cur_file.write(data)

分割大型2GB XML文件时出错 - UnicodeErrors：'charmap'编解码器...字符映射到<undefined>

1 个答案: