我有一个包含1,000个文件的目录。每个文件都有许多行,每行是一个ngram,从4到8个字节不等。我正在尝试解析所有文件以获取不同的ngrams作为标题行,然后对于每个文件,我想写一个具有该ngram序列频率的行发生在文件中。
以下代码通过收集标头来实现,但在尝试将标头写入csv文件时遇到内存错误。我在具有30GB RAM的Amazon EC2实例上运行它。任何人都可以提供我不知道的优化建议吗?
#Note: A combination of a list and a set is used to maintain order of metadata
#but still get performance since non-meta headers do not need to maintain order
header_list = []
header_set = set()
header_list.extend(META_LIST)
for ngram_dir in NGRAM_DIRS:
ngram_files = os.listdir(ngram_dir)
for ngram_file in ngram_files:
with open(ngram_dir+ngram_file, 'r') as file:
for line in file:
if not '.' in line and line.rstrip('\n') not in IGNORE_LIST:
header_set.add(line.rstrip('\n'))
header_list.extend(header_set)#MEMORY ERROR OCCURRED HERE
outfile = open(MODEL_DIR+MODEL_FILE_NAME, 'w')
csvwriter = csv.writer(outfile)
csvwriter.writerow(header_list)
#Convert ngram representations to vector model of frequencies
for ngram_dir in NGRAM_DIRS:
ngram_files = os.listdir(ngram_dir)
for ngram_file in ngram_files:
with open(ngram_dir+ngram_file, 'r') as file:
write_list = []
linecount = 0
header_dict = collections.OrderedDict.fromkeys(header_set, 0)
while linecount < META_FIELDS: #META_FIELDS = 3
line = file.readline()
write_list.append(line.rstrip('\n'))
linecount += 1
file_counter = collections.Counter(line.rstrip('\n') for line in file)
header_dict.update(file_counter)
for value in header_dict.itervalues():
write_list.append(value)
csvwriter.writerow(write_list)
outfile.close()
答案 0 :(得分:0)
然后不要扩展该列表。使用来自itertools的链来链接列表并设置。
而不是:
header_list.extend(header_set)#MEMORY ERROR OCCURRED HERE
这样做(假设csvwriter.writerow接受任何迭代器):
headers = itertools.chain(header_list, header_set)
...
csvwriter.writerow(headers)
这至少应该避免你目前看到的内存问题。