所以我决定在Python3中实现合并排序来处理大型CSV文件(使用5GB文件>。<)并且我认为我的逻辑正确,问题是,它很慢,我是只是想知道你们是否有任何建议如何改变我的代码以获得更快的性能? 谢谢,请耐心等待我的代码,我还是Python的新手^^
这是合并排序代码的主要部分,请注意,在将文件分成块并对每个块进行排序后, :
def merge_sort():
files_to_merge = os.listdir(temp_folder)
files_left = len(files_to_merge)
print("Merging {} files...".format(files_left))
temp_file_count = files_left + 1
while files_left != 1:
first_file = temp_folder + files_to_merge[0]
print(first_file)
second_file = temp_folder + files_to_merge[1]
print(second_file)
# Process both files.
with open(first_file, 'r', encoding='utf-8') as file_1:
with open(second_file, 'r', encoding='utf-8')as file_2:
# Setup
temp_file = temp_folder + "tempFile - {:03}.csv".format(temp_file_count)
file1_line, file2_line = file_1.readline(), file_2.readline()
compare_values_list = [file1_line.split(','), file2_line.split(',')]
print("Writing to >> {}...".format(temp_file))
# Keep going until all values have been read from both files.
with open(temp_file, 'a', encoding='utf-8') as m_file:
while len(compare_values_list) != 0 or (file1_line != '' or file2_line != ''):
# Grab the highest value from the list, write to a file, and delete it.
compare_values_list.sort(key=sorter) # sorter = operator.itemgetter(sort_key)
line_to_write = ','.join(compare_values_list[0])
del compare_values_list[0]
m_file.write(line_to_write)
# Get the next values from the file and check whether to add to the list.
file1_line, file2_line = file_1.readline(), file_2.readline()
if file1_line != '' and file2_line != '':
compare_values_list.append(file1_line.split(','))
compare_values_list.append(file2_line.split(','))
elif file1_line != '' and file2_line == '':
compare_values_list.append(file1_line.split(','))
elif file1_line == '' and file2_line != '':
compare_values_list.append(file2_line.split(','))
# Clean up files and update values.
os.remove(first_file)
os.remove(second_file)
temp_file_count += 1
files_to_merge = os.listdir(temp_folder)
files_left = len(files_to_merge)
print("Finish merging files.")
答案 0 :(得分:1)
有两个缓慢的部分跳出来。
首先,您的脚本在写入内容时会打开临时文件。将这些行移到嵌套的while循环之外:
with open(temp_file, 'a', encoding='utf-8') as m_file:
m_file.write(line_to_write)
你也可以考虑将数据保存到内存中的变量中,但是如果文件很大,我不确定这个想法有多好。
其次,您使用的是compare_values_list
。您经常附加和删除,这需要大量的工作来重新分配内存中的空间。您还经常从头开始重新创建列表 very 。首先尝试避免每个循环的列表副本并进行排序:
compare_values_list.sort(key=sorter)
应该可以帮助你避免这种情况。如果您想尝试加快速度,请预先分配列表并管理其大小。类似的东西:
compare_values_list_capacity = 1000
compare_values_list_size = 0
compare_values_list = [None]*compare_values_list_capacity
虽然我对混合这两种解决方案的细节感到朦胧 - 我不确定这是否适用于分类,所以值得尝试并看看哪些有效。