在python中旋转大文本文件

时间:2014-09-24 14:32:51

标签: python pandas pivot large-data

我有一个大文本文件(超过10 GB)的不透露数据,如下所示:

id   name   info   group   count
1    a1     aa1    g1      3
1    a1     aa1    g2      6
1    a1     aa1    g3      1
2    a2     aa2    g1      5
2    a2     aa2    g2      18
3    a3     aa3    g2      7
3    a3     aa3    g4      2

我想得到一个像这样的新文件:

id   name   info   g1   g2   g3   g4
1    a1     aa1    3    6    1    0
2    a2     aa2    5    18   0    0
3    a3     aa3    0    7    0    2

同样在我的数据中,有超过100个可能的组,我不知道确切的数字。

任何想法如何解决这个问题?

1 个答案:

答案 0 :(得分:-2)

In [2]: f = open('t.txt')

# first pass: determine group names
In [3]: header = next(f)
In [4]: groups = dict()
In [5]: for line in f:
   ...:   tokens = line.split()
   ...:   groups[tokens[3]] = 0
   ...: 
In [6]: groups
Out[6]: {'g4': 0, 'g3': 0, 'g2': 0, 'g1': 0}

# second pass: stream through data and print
In [7]: f.seek(0)
In [8]: next(f)

In [21]: for line in f:
  tokens = line.split()
  if tokens[0] != id:
    if id is not None:
      print(tokens[0], tokens[1], tokens[2], end=' ')
      for name in sorted(groups.keys()):
        print(groups[name], end=' ')
        groups[name] = 0
    print(); id = tokens[0]
  groups[tokens[3]] += int(tokens[4])
   ....: 
1 a1 aa1 0 7 0 2 
2 a2 aa2 3 6 1 0 
3 a3 aa3 5 18 0 0 

有一两个错误,作为练习留下!