Question

我有一个多列（13列）空格分隔文件（大约500万+行），如下所示：

 1. W5 403  407 P Y 2 2 PR 22  PNIYR 22222 12.753 13.247
 2. W5 404  408 N V 2 2 PR 22  PIYYR 22222 13.216 13.247
 3. W3 274  276 E G 1 1 EG 11  EPG 121 6.492 6.492
 4. W3 275  277 P R 2 1 PR 21  PGR 211 6.365 7.503
 5. W3 276  278 G Y 1 1 GY 11  GRY 111 5.479 5.479
 6. W3 46  49 G L 1 1 GY 11  GRY 111 5.176 5.176
 7. W4 47  50 D K 1 1 DK 11  DILK 1111 4.893 5.278
 8. W4 48  51 I K 1 1 IK 11  ILKK 1111 4.985 5.552

等等，

我对其中的两列（第8栏和第11栏）感兴趣，并且想要计算特定对（第8列）的出现次数（后面的第11列）。

例如，引用键'GY'：'111'的出现次数：2 键'PR'：'22222'的出现次数：2 键'DK'：'1111'的出现次数：1 键'EG'：'121'的出现次数：1

我有一个基于dict的基本实现。

countshash={}
for l in bigtable:
          cont = l.split()
          if cont[7] not in countshash: countshash[cont[7]] = {}
          if cont[11] not in countshash[cont[7]]: countshash[cont[7]][cont[10]] = 0
          countshash[cont[7]][cont[10]]+= 1;

我也有一个简单的基于awk的计数（这是超快速的）但是想知道一个高效的更快的方式在python中执行此操作。感谢您的投入。

Answer 1

我不确定这是否会对速度有所帮助，但是你创造了大量的defaultdict类对象，我认为你可以使它更具可读性：

from collections import defaultdict

countshash = defaultdict(lambda: defaultdict(int))

for l in bigtable:
    cont = l.split()
    countshash[cont[7]][cont[10]] += 1

Answer 2

from collections import Counter
Counter(tuple(row.split()[8:12:3]) for row in bigtable)

使用itemgetter更灵活，可能比切片更有效

from operator import itemgetter
ig = itemgetter(8, 11)
Counter(ig(row.split()) for row in bigtable)

使用imap可以使事情变得更快

from itertools import imap
Counter(imap(ig, imap(str.split, bigtable)))

Answer 3

你正在进行双重查询。你可以做countshash[(cont[7],count[10])]+=1，这可能会更快，但取决于python如何实现它。内存占用量应略大。

简单的事情：

countshash=defaultdict(int)
for l in bigtable:
          cont = l.split()
          countshash[(cont[7],cont[10])]+= 1;

Answer 4

from collections import defaultdict

countshash = defaultdict(int)
for l in bigtable:
    cont = l.split()
    countshash[cont[7], cont[10]] += 1

计算其中一个引用的三个项目的最快方法（使用python）？

4 个答案: