我已经在Python中创建了一个.txt文件目录列表,然后编写了一个函数来组合这些目录。
def combine_directory_txt(FilePaths):
"""
This function will combine all files in a directory by importing each,
and appending them to a single output. It only works for csv's (.txt) with
a delimeter of "|"
"""
Output = pd.DataFrame() # Dataframe which will store the final table
Increment = 0
Total = len(FilePaths)
# Import each file and join them together
for file in FilePaths:
Increment += 1
Import = pd.read_csv(file, sep = '|', error_bad_lines = False,
low_memory = False, encoding='mbcs' )
Output = Output.append(Import)
print (Increment, " of ", Total, " joined")
del Import
return Output
这正常工作,除非我的PC遇到MemoryErrors。有没有更有效的方法可以做到这一点?我意识到我已经使用过“ low_memory = false”,该过程每月重复一次,所以我不知道列将是什么样,并且由于所有dtype警告,我的代码很早就失败了。这是正确的方法吗?我应该编写代码来弄清楚什么是dtype,然后将它们分配给它们以减少内存吗?
答案 0 :(得分:1)
您的方法是将每个CSV文件读入内存,并将它们全部合并并返回结果数据帧。相反,您应该一次处理一个CSV文件,每次将结果写入output.csv
文件中。
以下脚本显示了如何完成此操作。它添加了用于输出的文件名。假定运行中的所有文件共享相同的格式,并且每个文件都具有相同的头。标头只写入一次CSV输出文件,然后在读取时跳过。
import csv
def combine_directory_txt(file_paths, output_filename):
# Get the header from the first CSV file passed
with open(file_paths[0], "rb") as f_input:
header = next(csv.reader(f_input, delimiter="|"))
with open(output_filename, "wb") as f_output:
csv_output = csv.writer(f_output, delimiter="|")
csv_output.writerow(header) # Write the header once
for file_name in file_paths:
with open(file_name, "rb") as f_input:
csv_input = csv.reader(f_input, delimiter="|")
next(csv_input) # Skip header
csv_output.writerows(csv_input)
combine_directory_txt(["mbcs_1.txt", "mbcs_2.txt"], "output.csv")
使用此方法,将大大减少内存需求。
答案 1 :(得分:0)
主要思想是读取数据块(行数),并将chunksize
参数传递给read_csv
,将数据附加到文件中。出于相同目的,可以选择将此参数传递给to_csv
。尽管我没有分析此代码,但通常来说,分块读取和分块写入可以提高IO性能,尤其是对于大文件。
def combine_directory_txt(file_paths, output_filename, chunksize):
"""Merge collection of files.
:param file_paths: Collection of paths of files to merge.
:param output_filename: Path of output file (i.e., merged file).
:param chunksize: Number of lines to read in at one time.
"""
with open(output_filename, "wb") as outfile:
chunk_transfer(file_paths[0], outfile, chunksize, append=False)
for path in file_paths[1:]:
chunk_transfer(path, outfile, chunksize, append=True)
def chunck_transfer(path, outfile, chunksize, append, include_index=False):
"""Transfer file at path to outfile in chunks.
:param path: Path of file to transfer.
:param outfile: File handler for output file.
:param chunksize: Number of lines to read at a time.
:param append: Whether to append to file or write new file.
:param include_index: Whether to include index of dataframe.
"""
with open(path, "rb") as infile:
df = pd.read_csv(infile,
sep='|',
error_bad_lines=False,
# low_memory=False,
encoding='mbcs',
chunksize=chunksize)
if append:
include_header = False
mode = 'a'
else:
include_header = True
mode = 'w'
# Possible to pass chunksize as an argument to to_csv
df.to_csv(outfile, mode=mode, header=include_header, index=include_index)