我正在为项目处理2个大型数据集文件。我逐行管理了清理文件。但是,在尝试应用相同的逻辑来基于公共列合并2个文件时,它会失败。问题是第二个循环完全运行然后顶部循环运行(不知道为什么会发生这种情况)。我尝试使用numpy
buys = np.genfromtxt('buys_dtsep.dat',delimiter=",",dtype='str')
clicks = np.genfromtxt('clicks_dtsep.dat',delimiter=",",dtype='str')
f = open('combined.dat', 'w')
for s in clicks:
for s2 in buys:
#process data
但是由于内存限制以及将数据加载到阵列然后处理它所需的时间,将具有3300万个条目的文件加载到数组中是不可行的。我试图逐行处理文件以避免内存不足。
buys = open('buys_dtsep.dat')
clicks = open('clicks_dtsep.dat')
f = open('combined.dat', 'w')
csv_buys = csv.reader(buys)
csv_clicks = csv.reader(clicks)
for s in csv_clicks:
print 'file 1 row x'#to check when it loops
for s2 in csv_buys:
print s2[0] #check looped data
#do merge op
打印输出应为
file 1 row 0
file 2 row 0
...
file 2 row x
file 1 row 1
and so on
我得到的输出是
file 2 row 0
file 2 row 1
...
file 2 row x
file 1 row 0
...
file 1 row z
如果可以解决上述循环问题,则无法逐行合并文件。
更新:示例数据
购买文件样本
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1
点击文件示例
420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:41:50,214537888,0
420374,2014-04-06,18:42:33,214537850,0
420374,2014-04-06,18:42:38,214537850,0
420374,2014-04-06,18:43:02,214537888,0
420374,2014-04-06,18:43:10,214537888,0
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0
答案 0 :(得分:1)
编辑:OP想要遍历第二个文件,所以我改变了我的答案
您在第一个文件中循环第一行,然后在第二个文件中循环。 你的内部循环只能工作一次,因为在第一次循环的第一次运行中将消耗csv_buys迭代器。
for s in csv_clicks: # <--- looping over the 1st file works fine
print 'file 1 row x'#to check when it loops
for s2 in csv_buys: #<--- loops all over the 2nd one and finish the iterator! this loop will ONLY work once!
print s2[0] #check looped data
#do merge op
您需要做的是:
for s in csv_clicks: # <--- stays the same - works fine
print 'file 1 row x'#to check when it loops
for s2 in open('buys_dtsep.dat'): #<---- Now you loop from the start each time :) yay
print s2[0] #check looped data
#do merge op
警告:上面的代码具有O ^ 2的复杂性。
如果您的脚本非常慢(并且它会),您将不得不考虑不同的解决方案
答案 1 :(得分:1)
以下方法有望提供帮助。它旨在加快速度并降低您的内存需求:
from heapq import merge
from itertools import groupby, ifilter
def get_click_entries(key):
with open('clicks.csv', 'rb') as f_clicks:
for entry in ifilter(lambda x: int(x[0]) == key, csv.reader(f_clicks)):
entry.insert(4, '') # add empty missing column
yield entry
# First create a set holding all column 0 click entries
with open('clicks.csv', 'rb') as f_clicks:
csv_clicks = csv.reader(f_clicks)
click_keys = {int(cols[0]) for cols in csv_clicks}
with open('buys.csv', 'rb') as f_buys, \
open('clicks.csv', 'rb') as f_clicks, \
open('merged.csv', 'wb') as f_merged:
csv_buys = csv.reader(f_buys)
csv_clicks = csv.reader(f_clicks)
csv_merged = csv.writer(f_merged)
for k, g in groupby(csv_buys, key=lambda x: int(x[0])):
if k in click_keys:
buys = sorted(g, key=lambda x: (x[1], x[2]))
clicks = sorted(get_click_entries(k), key=lambda x: (x[1], x[2]))
csv_merged.writerows(merge(buys, clicks)) # merge the two lists based on the timestamp
click_keys.remove(k)
csv_merged.writerows(g)
# Write any remaining click entries
for k in click_keys:
csv_merged.writerows(get_click_entries(k))
对于两个示例文件,这将产生以下输出:
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
420374,2014-04-06,18:41:50,214537888,,0
420374,2014-04-06,18:42:33,214537850,,0
420374,2014-04-06,18:42:38,214537850,,0
420374,2014-04-06,18:43:02,214537888,,0
420374,2014-04-06,18:43:10,214537888,,0
420374,2014-04-06,18:44:58,214537888,,0
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
420369,2014-04-07,19:39:43,214839373,,0
420369,2014-04-07,19:39:56,214684513,,0
首先创建一组所有列0条目,这意味着如果知道该条目不存在,则可以避免重新读取整个单击文件。然后,它尝试从buys
读入一组匹配的第0列条目,并从clicks
读入相应的第0列条目列表。然后根据时间戳对它们进行排序,并按顺序合并在一起。然后从集合中删除此条目,以便不重读它们。
答案 2 :(得分:0)
我已将文件替换为StringIO。文件对象代码看起来相同。
import StringIO
file1 = StringIO.StringIO("""420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1""")
file2 = StringIO.StringIO("""420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:41:50,214537888,0
420374,2014-04-06,18:42:33,214537850,0
420374,2014-04-06,18:42:38,214537850,0
420374,2014-04-06,18:43:02,214537888,0
420374,2014-04-06,18:43:10,214537888,0
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0""")
outfile = StringIO.StringIO()
data1_iter, skip_1 = iter(file1), False
data2_iter, skip_2 = iter(file2), False
while True:
out = []
if not skip_1:
try:
out.append(next(data1_iter).split()[0])
except StopIteration:
skip_1 = True
if not skip_2:
try:
out.append(next(data2_iter).split()[0])
except StopIteration:
skip_2 = True
outfile.write('\n'.join(out) + "\n")
if skip_1 and skip_2:
break
print(outfile.getvalue())
输出:
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:44:58.325,214537850,10471,1
420374,2014-04-06,18:41:50,214537888,0
281626,2014-04-06,09:40:13.032,214535653,1883,1
420374,2014-04-06,18:42:33,214537850,0
420368,2014-04-04,06:13:28.848,214530572,6073,1
420374,2014-04-06,18:42:38,214537850,0
420368,2014-04-04,06:13:28.858,214835025,2617,1
420374,2014-04-06,18:43:02,214537888,0
140806,2014-04-07,09:22:28.132,214668193,523,1
420374,2014-04-06,18:43:10,214537888,0
140806,2014-04-07,09:22:28.176,214587399,1046,1
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0