按第一列

Question

我的数据（〜以GB为单位的大小）如下所示：

我希望结果如下：

1 11 66 77
2 44 55
3 88

我尝试使用字典，但最高速度为80k行/分钟。我有16米行......

这里有更好的解决方案吗？

我的代码：

data=open(file)
d={};seen=[]
for line in data:
    if line[0] not in seen:
         d[line[0]]=line[1]
         seen.append(line[0])
    else:
         d[line[0]].append(line[1])
pickle.dump(d,file.name)

Answer 1

奇妙的工具defaultdict在这里有所帮助。它的行为类似于字典，但您可以附加到每个值 - 列表 - 并自动创建列表。

按第一列

分组的打印值

input = '''1 11
2 44
1 66
3 88
1 77
2 55'''.split('\n')

import collections

data = collections.defaultdict(list)
for line in input:
    id_, value = line.split()
    data[id_].append(value)

for key,values in data.iteritems():
    print key, ' '.join(values)

输出

1 11 66 77
3 88
2 44 55

请注意，输出未排序。如果输入很大，则需要额外的内存来对输入值进行排序。如果需要，请将代码更改为for key,values in sorted(data.iteritems()):

Answer 2

你考虑过熊猫吗？我认为Pandas可以更快地读取文件，并且以比普通Python快得多的速度进行计算。以下内容可能很有用......

temp = pd.read_csv('file.csv', header=None, names=['a', 'b'])
print temp.groupby('a').agg( lambda xs: ' '.join( map(str, xs)  )  ).reset_index()

将打印以下内容：

   a         b
0  1  11 66 77
1  2     44 55
2  3        88

试一试......

Python聚合数据大文件以GB为单位，堆栈数据

2 个答案:

按第一列

输出