Question

我将4GB Wiktionary XML数据转储拆分为较小的文件，没有重叠，用Python处理并保存不同的页面（...）。

分成不同文件的相同信息正在膨胀到18+ GB。

为什么会这样？有没有办法避免这种情况？

import os 
import re
import subprocess

subprocess.call(['mkdir', 'WIKTIONARY_WORDS_DUMP'])

# English Wiktionary (which noneless contains many foreign words!)
f = open('enwiktionary-20151020-pages-articles.xml', 'r')

page = False
number = 1
for i, l in enumerate(f): 

    if '<page>' in l:
        word_file = open(os.path.join('WIKTIONARY_WORDS_DUMP', str(number)+'.xml'), 'a')
        word_file.write(l)
        page = True
        number += 1

    elif '</page>' in l:
        word_file.write(l)
        word_file.close()
        page = False

    elif page:
        word_file.write(l)


word_file.close()
f.close()

Answer 1

较小的文件是否也保存为XML，具有相同的标记层次结构？如果是这样，你必然会重复标记。

即。如果您要拆分此文件：

easy_install-2.7 numpy

分为三个单独的文件：

<root>
    <item>abc</item>
    <item>def</item>
    <item>ghi</item>
</root>

在每个较小的文件中重复<root> <item>abc</abc> </root> <root> <item>def</abc> </root> <root> <item>ghi</abc> </root>标记。

如果您的数据方案更复杂，情况会变得更糟：

<root>

拆分文件大大增加了它的大小

1 个答案: