我有一个大文本文件(超过10 GB)的不透露数据,如下所示:
id name info group count
1 a1 aa1 g1 3
1 a1 aa1 g2 6
1 a1 aa1 g3 1
2 a2 aa2 g1 5
2 a2 aa2 g2 18
3 a3 aa3 g2 7
3 a3 aa3 g4 2
我想得到一个像这样的新文件:
id name info g1 g2 g3 g4
1 a1 aa1 3 6 1 0
2 a2 aa2 5 18 0 0
3 a3 aa3 0 7 0 2
同样在我的数据中,有超过100个可能的组,我不知道确切的数字。
任何想法如何解决这个问题?
答案 0 :(得分:-2)
In [2]: f = open('t.txt')
# first pass: determine group names
In [3]: header = next(f)
In [4]: groups = dict()
In [5]: for line in f:
...: tokens = line.split()
...: groups[tokens[3]] = 0
...:
In [6]: groups
Out[6]: {'g4': 0, 'g3': 0, 'g2': 0, 'g1': 0}
# second pass: stream through data and print
In [7]: f.seek(0)
In [8]: next(f)
In [21]: for line in f:
tokens = line.split()
if tokens[0] != id:
if id is not None:
print(tokens[0], tokens[1], tokens[2], end=' ')
for name in sorted(groups.keys()):
print(groups[name], end=' ')
groups[name] = 0
print(); id = tokens[0]
groups[tokens[3]] += int(tokens[4])
....:
1 a1 aa1 0 7 0 2
2 a2 aa2 3 6 1 0
3 a3 aa3 5 18 0 0
有一两个错误,作为练习留下!