Question

我有一个包含数千行的巨大文件，如下所示：

`C509.TCGA-78-7159-10A-01D-2036-08.1-C509.  1   0   0   1   0   0
 C509.TCGA-78-7159-10A-01D-2036-08.1-C509.  0   1   1   0   1   1`

如果第一列的行匹配，我想将一行的第二列与第二行的第二列，第三列的第三列等相加，不使用pandas 。也许最好使用python，而不是awk因为它的大小。

输出应为：

C509.TCGA-78-7159-10A-01D-2036-08.1-C509. 1 1 1 1 1 1

感谢您的帮助：）

Answer 1

如果您将数据作为列表列表加载到Python中，则可以执行以下操作

from operator import add

data = [['C509.TCGA-78-7159-10A-01D-2036-08.1-C509.',  1,   0,   0,   1,   0,   0],
        ['C509.TCGA-78-7159-10A-01D-2036-08.1-C510.',  0,   1,   1,   0,   1,   1,],
        ['C509.TCGA-78-7159-10A-01D-2036-08.1-C509.',  1,   0,   0,   1,   1,   0],
        ['C509.TCGA-78-7159-10A-01D-2036-08.1-C509.',  1,   0,   0,   1,   0,   2],]

dic = {}
for i in data:
    if not i[0] in dic: dic.update({i[0]: i[1::]})
    else: dic[i[0]] = list(map(add, dic[i[0]], i[1::]))

这会为您提供一个字典，其中包含每个唯一的第一个值以及其他列的总和。

{'C509.TCGA-78-7159-10A-01D-2036-08.1-C509。'：[3,0,0,3,1,2]，
'C509.TCGA-78-7159-10A-01D-2036-08.1-C510。'：[0,1,1,0,1,1]}

Answer 2

您可以使用以下内容：

import re

res = dict()

with open("mydata.txt") as f:
  for line in f:
    id, col1, col2, col3, col4, col5, col6 = re.split(r"\s+", line)
    res.setdefault(id, [0] * 6)
    res[id][0] += int(col1)
    res[id][1] += int(col2)
    res[id][2] += int(col3)
    # ... and so on for the rest of cols

输入的输出是：

print(res)
{'C509.TCGA-78-7159-10A-01D-2036-08.1-C509.': [1, 1, 1, 1, 1, 1]}

Python：匹配时的sum列

2 个答案: