对于我的一个数据分析管道,我最终会生成大量单独的CSV文件。我想转置它们,连接它们,并再次转置它们。但是,数据量很大,因此将其全部加载到内存中是不切实际的。
答案 0 :(得分:1)
连接两个csv文件中的数据行(如果这就是你的意思)而不将它们全部加载到内存中是一个相对简单快速的操作:只需从每一行中读取一行,将它们连接在一起,然后将其写入输出文件,重复直到所有输入数据都耗尽。
在csv文件中转换数据无需将整个内容读入内存本质上是一个更慢的过程,因为它需要在多个过程中重读整个输入文件,每次只从一个中提取数据它包含的列。如果这是可接受的(或必要的)权衡,这里基本上是如何使用内置的csv
模块完成的:
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
with open(output_filename, 'wb') as outputf:
writer = csv.writer(outputf)
with open(input_filename, 'rb') as inputf:
# determine number of columns in input file by counting those in its first row
# number of cols in input file determines number of rows in output file
numcols = len(csv.reader(inputf).next())
# read entire input file multiple times, extracting one column from each row
for col_index in xrange(numcols):
# write all of column data as a single row of the output file
inputf.seek(0) # rewind file for each pass
writer.writerow(tuple(row[col_index] for row in csv.reader(inputf)))
答案 1 :(得分:0)
以下代码模拟从两个csv文件中读取。第一个有两个 行
[1,2,1]
[3,4,1]
第二个
[7,8,2]
[9,10.2].
结果是两行
[1,2,1,7,8,2]
[3,4,1,9,10,2]
这就是你想要的吗?
def source1():
for i in [ [1,2, 1] ,[3,4, 1]] : yield i
def source2():
for i in [ [7,8,2] ,[9,10,2]] : yield i
def join(*sources):
while True:
row = []
for s in sources:
row.extend(s.next())
yield row
for row in join(source1(), source2()):
print row
在你的情况下,你必须用csv文件迭代器替换对source1()和source2()的调用。
答案 2 :(得分:0)
当字段具有固定宽度时,这是一个有效的解决方案:
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
with open(path_in) as fd_in:
line = fd_in.readline()
l = line.split()
field_width = int(len(line)/len(l))
file_size = os.path.getsize(path_in)
cols2 = rows1 = line_count = int(file_size/len(line))
rows2 = cols1 = len(l)
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
for row in range(rows2):
for col in range(cols2-1):
fd_in.seek(col*len(line)+row*field_width)
fd_out.write('{} '.format(fd_in.read(field_width-1)))
fd_in.seek((col+1)*len(line)+row*field_width)
fd_out.write('{}\n'.format(fd_in.read(field_width-1)))
return
if __name__ == '__main__':
main()
如果字段没有固定的宽度,这是一个有效的解决方案:
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
separator = ' '
d_seek = {}
with open(path_in) as fd_in:
i = 0
while True:
tell = fd_in.tell()
if fd_in.readline() == '':
break
d_seek[i] = tell
i += 1
cols2 = rows1 = i
with open(path_in) as fd_in:
line = fd_in.readline()
rows2 = cols1 = len(line.split(separator))
del line
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
for row2 in range(rows2):
for row1 in range(rows1):
fd_in.seek(d_seek[row1])
j = 0
s = ''
while True:
char = fd_in.read(1)
j += 1
if char == separator or char == '\n':
break
s += char
d_seek[row1] += len(s)+1
if row1+1 < rows1:
fd_out.write('{} '.format(s))
else:
fd_out.write('{}\n'.format(s))
return
if __name__ == '__main__':
main()
答案 3 :(得分:0)
另一种简短的pythonic解决方案。我用它来转置15,000,000 x 12,000的CSV。它是快速而纯粹的蟒蛇。你需要做的其他事情都是微不足道的,这绝对是最难的部分。
Github链接:https://gist.github.com/arose13/facfb91b609d453f3ad840417faa503a
def transpose_csv_out_of_core(csv_path, output_csv_path='transposed.csv', delimiter=','):
"""
On my laptop it can transpose at ~375,000 lines a sec
:param csv_path:
:param output_csv_path:
:param delimiter:
:return:
"""
import csv
transposed_iterator = zip(*csv.reader(open(csv_path)))
with open(output_csv_path, 'w') as out:
for row in transposed_iterator:
out.write(delimiter.join(row) + '\n')
答案 4 :(得分:-1)
使用生成器,例如
from itertools import izip
file1 = open("test", "r")
file2 = open("test2", "r")
def lazy(file):
for line in file:
#do something with the line
yield line
for lines in izip(lazy(file1), lazy(file2)):
print lines
http://wiki.python.org/moin/Generators
编辑:你可以使用CSV模块解析它,我也意识到文件对象的readlines()方法不是懒惰的,所以你必须在文件模式中使用for行。