Question

我的程序首先在100个集群中聚类一个大数据集，然后使用multiprocessing在数据集的每个集群上运行模型。我的目标是在一个大的csv文件中连接所有输出值，这是100个拟合模型中所有输出数据的串联。

现在，我只是创建了100个csv文件，然后在包含这些文件的文件夹上循环并逐个复制它们并在一个大文件中逐行复制。

我的问题：有没有更聪明的方法来获取这个大输出文件而不导出100个文件。我使用pandas和scikit-learn进行数据处理，使用multiprocessing进行并行化。

Answer 1

如果您的所有部分csv文件都没有标题并共享列号和顺序，您可以将它们连接起来：

with open("unified.csv", "w") as unified_csv_file:
    for partial_csv_name in partial_csv_names:
        with open(partial_csv_name) as partial_csv_file:
            unified_csv_file.write(partial_csv_file.read())

Answer 2

让您的处理线程将数据集返回到主进程而不是自己编写csv文件，然后当它们将数据提供给主进程时，让它将它们写入一个连续的csv。

from multiprocessing import Process, Manager

def worker_func(proc_id,results):

    # Do your thing

    results[proc_id] = ["your dataset from %s" % proc_id]

def convert_dataset_to_csv(dataset):

    # Placeholder example.  I realize what its doing is ridiculous

    converted_dataset = [ ','.join(data.split()) for data in dataset]
    return  converted_dataset

m = Manager()
d_results= m.dict()

worker_count = 100

jobs = [Process(target=worker_func,
        args=(proc_id,d_results))
        for proc_id in range(worker_count)]

for j in jobs:
    j.start()

for j in jobs:
    j.join()


with open('somecsv.csv','w') as f:

    for d in d_results.values():

        # if the actual conversion function benefits from multiprocessing,
        # you can do that there too instead of here

        for r in convert_dataset_to_csv(d):
            f.write(r + '\n')

Answer 3

从http://computer-programming-forum.com/56-python/b7650ebd401d958c.htm捏了一下它的内脏，这是一颗宝石。

#!/usr/bin/python
# -*- coding: utf-8 -*-
from glob import glob
n=1
file_list = glob('/home/rolf/*.csv')
concat_file = open('concatenated.csv','w')
files = map(lambda f: open(f, 'r').read, file_list)
print "There are {x} files to be concatenated".format(x=len(files))    
for f in files:
    print "files added {n}".format(n=n)
    concat_file.write(f())
    n+=1
concat_file.close()

用python很好地连接csv文件

3 个答案: