Question

我有大约50个大数据集，其列数约为200K-500K，我试图想出一种有效合并/连接这些数据集的方法。执行这些文件的条件列连接（合并）的最快方法是什么？

目前，我有一个下面列出的代码，但是这段代码需要几个小时（至少12个小时）来完成我的数据集的工作。请记住，这些输入文件（数据集）将非常大，有没有办法调整此代码以尽可能使用最小内存？我提出的一个线索（通过查看下面的代码）是打开文件后关闭文件，但我不知道该怎么做。

Note that:
a.  All files have the same number of rows
b.  The first two columns are the same throughout the files
c.  All files are tab delimited
d.  This code works but it is ridiculously slow!

下面说明的代码适用于样本数据集。与我的大型数据集一样，下面的数据集具有相同的前两列。我感谢任何有关如何有效运行代码的反馈或建议，或者有效地执行工作的替代方法。

Input 1: test_c1_k2_txt.gz :-
c1  c2  1.8 1.9 1.7
L1  P   0.5 1.4 1.1
L2  P   0.4 1.8 1.2
L3  P   0.1 1.9 1.3

Input 2: test_c1_k4_txt.gz :-
c1  c2  0.1 0.9 1.1 1.2
L1  P   1.8 1.7 1.8 2.8
L2  P   1.3 1.4 1.2 1.1
L3  P   1.7 1.6 1.5 1.4

Input 3: test_c3_k1_txt.gz :-
c1  c2  1.3 1.4
L1  P   1.1 2.9
L2  P   2.2 1.4
L3  P   1.7 1.6

Output : - test_all_c_all_k_concatenated.txt.gz :-
c1  c2  1.8 1.9 1.7 0.1 0.9 1.1 1.2 1.3 1.4
L1  P   0.5 1.4 1.1 1.8 1.7 1.8 2.8 1.1 2.9
L2  P   0.4 1.8 1.2 1.3 1.4 1.2 1.1 2.2 1.4
L3  P   0.1 1.9 1.3 1.7 1.6 1.5 1.4 1.7 1.6

用于合并/连接的Python代码

import os,glob,sys,gzip,time


start_time=time.time()

max_c=3
max_k=4

filearr=[]

# Loop through the files, in the order of “c” first and then in the order of “k” and create a file array
for c in range(1,max_c):
    for k in range(1,max_k):
    # Set my string of file name
        fname= "test_c"+str(c)+"_k"+str(k)+"_txt.gz"
    # If the file name specified exists, ..
        if os.path.exists(fname):
            print ("Input file "+ fname+ " exists ... ")
        # Open files and create a list array
            files=[gzip.open(f) for f in glob.glob(fname)]
        filearr=filearr+files

# Initialize a list array to append columns to
d=[]
for file in filearr:
    # row strip each line for each file
    row_list=[line.rstrip().split('\t') for line in file.readlines()]
    # Transpose the list array to make columns for each file
    row_list_t=[[r[col] for r in row_list] for col in range(len(row_list[0]))]
    # Combine the transposed rows from each file into one file
    d=d+row_list_t

# Initialize an empty array
temp=[]
for i in (d):
        # Append new columns each time
    if i not in temp:
         temp.append(i)
appended=[[r[col] for r in temp] for col in range(len(temp[0]))]

# Write output dataset into a tab delimited file
outfile=gzip.open('all_c_all_k_concatenated.txt.gz','w')
for i in appended:
    for j in i[:-1]:
        outfile.write(j+'\t')
    outfile.write(i[-1]+'\n')
outfile.close()
print 'executed prob file concatenation sucessfully. '

total_time=time.time() - start_time
print "Total time it took to finish: ", total_time

Answer 1

您的代码难以阅读;但是，我可以在这里看到两个O（N ^ 2）操作。

第一个是在循环内执行d = d + row_list_t的地方。该操作每次都创建一个新列表，因此它是O（N），这使得循环在O（N ^ 2）中。切换到使用append方法来改善这一点。

第二个是你执行if i not in temp:的地方。搜索列表是O（N），它使你的循环O（N ^ 2）。添加使用集合以进行存在检查以解决此问题。（所需的额外O（N）内存与已经使用的内存相比并不算什么）并且值得加速。

然而，这可能无法解决您的所有问题;可能会有更多，所以您可以做的最好的事情是在程序开始时import time，然后在程序的每个部分之前调用print time.time()。这将让您了解哪些部件的运行速度比其他部分慢，您可以尝试找出解决问题的方法。

Answer 2

以下代码是处理数据合并问题的有效方法。它会打开所有文件。然后它复制第一个数据文件中的第一行 - 这是两个列标题加上所有值。接下来，对于除第一个之外的每个输入文件，它读取一行，对前两个标题列进行zaps，并将其写入输出数据集。每个输入文件的值都与其他文件分开。

玩得开心！

#!/usr/bin/env python

import glob, gzip, re

data_files = [ gzip.open(name) for name in sorted(
    glob.glob('*_txt.gz')
) ]

# we'll use the two header columns from the first file
firstf = data_files.pop(0)

outf = gzip.open('all_c_all_k_concatenated.txt.gz', 'w')
for recnum,fline in enumerate( firstf ):

    print 'record', recnum+1

    # output header columns plus first batch of data
    outf.write( fline.rstrip() )

    # separate first file's values from others
    outf.write( ' ' )

    # for each input, read one line of data, write values
    for dataf in data_files:
        # read line with headers and values
        line = dataf.next()

        # zap two header columns
        line = re.sub(r'^\S+\s+\S+\s+', '', line)

        outf.write( line.rstrip() )

        # separate this file's values from next
        outf.write( ' ' )

    # finish the line of data
    outf.write( '\n' )

outf.close()

如何在Python中执行高效的连接（合并）

2 个答案: