Question

我有一个像这样的文本文件：

chrX    7970000    8670000   3  2   7   7   RPS6KA6   4
chrX    7970000    8670000   3  2   7   7     SATL1   3
chrX    7970000    8670000   3  2   7   7   SH3BGRL   4
chrX    7970000    8670000   3  2   7   7      VCX2   1
chrX   86580000   86980000   1  1   1   5     KLHL4   2
chrX   87370000   88620000   4  4  11  11    CPXCR1   2
chrX   87370000   88620000   4  4  11  11     FAM9A   2
chrX   89050000   91020000  11  6  10  13     FAM9B   3
chrX   89050000   91020000  11  6  10  13    PABPC5   2

我想计算每行重复的次数（only 1st, 2nd and 3rd columns）。在output中将有5 columns。 1st 3 columns将是相同的（每行仅重复一次），但是在4th column中，same column和same line中将有多个字符（这些字符位于{{1}中的8th column）。 original file是5th column中1st 3 lines are repeated的次数。

original file：在in short中，为输出文件的input file。我们应该计算columns 4,5,6,7 and 9 are useless所在的行数，因此，在1st 3 columns are the same中output file（但1st 3 column would be the same as input file）中计数。 only repeated once行重复。 5th column is the number of times是4th column of output中所有重复行中的字符。在8th column中，此行是expected output：repeated 4 times。因此chrX 7970000 8670000和5th column is 4。如您在4th column is: RPS6KA6,SATL1,SH3BGRL,VCX2中看到的字符。

这是预期的输出：

4th column are comma separated

我试图用Python做到这一点，并编写了以下代码：

chrX    7970000 8670000 RPS6KA6,SATL1,SH3BGRL,VCX2  4
chrX    86580000    86980000    KLHL4   1
chrX    87370000    88620000    CPXCR1,FAM9A    2
chrX    89050000    91020000    FAM9B,PABPC5    2

此代码不返回我想要的。你知道如何解决吗？

Answer 1

替代解决方案：

from collections import defaultdict
summary = defaultdict(list)

# Input and collate
with open('myfile.txt', 'r') as fp:
    for line in fp:
        items = line.strip().split()
        key, data = (items[0], items[1], items[2]), items[7]
        summary[key].append(data)

# Output
for keys, entries in summary.items():
    print('{keys}\t{entries} {count}'.format(
          keys=' '.join(keys),
          entries=','.join(entries), 
          count=len(entries) ))

在Python 2.7中-产生输出

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 89050000 91020000  FAM9B,PABPC5 2
chrX 87370000 88620000  CPXCR1,FAM9A 2
chrX 86580000 86980000  KLHL4 1

对于Python 3.6，输出为：

chrX 7970000 8670000    RPS6KA6,SATL1,SH3BGRL,VCX2 4
chrX 86580000 86980000  KLHL4 1
chrX 87370000 88620000  CPXCR1,FAM9A 2
chrX 89050000 91020000  FAM9B,PABPC5 2

两个Python版本之间的输出顺序不同，这是因为Python 3.6中的字典（以及扩展名defaultdicts）保留了插入键的顺序。从您的描述中不清楚订购是否重要。

我认为您的版本不起作用的主要原因是您的表达式infile[0,1,2,7, count]并没有您认为的那样。

似乎您希望从行中提取第0、1、2、7列。但是，这在Python中不是有效的索引符号，而且Python无论如何都不知道数据中的列-它只知道字符。

在我的版本中，我在每一行上使用'split'方法-它将根据空格/制表符所在的位置来分隔行-即将数据拆分为列。

Answer 2

这应该做您想要的：

from collection import defaultdict # 1

lines = [line.rstrip().split() for line in open('file.txt').readlines()] # 2

counter = defaultdict(list) # 3
for line in lines:
    counter[(line[0], line[1], line[2])].append(line[7]) # 4

for key, value in counter.iteritems(): # 5
    print '{} {} {}'.format(' '.join(key), ','.join(value), len(value)) # 6

说明：

我们将使用一个方便的库，该库为我们提供具有默认值的字典
读取整个输入文件，删除最后一行的新行并分成几部分（在空白处）
制作一个字典，其默认值是任何键的空列表
遍历各行并填充字典
1. 第1-3列是关键
2. 对于第8列中的每个字符序列，我们将其追加到列表中（如果我们未将defaultdict与list一起使用，则会更加复杂）
迭代字典的键/值对
打印输出，将数据结构连接到所需的格式。

希望这会有所帮助。

Answer 3

这是使用pandas的绝佳机会。您可以这样打开文件：

import pandas as pd
# open file
df = pd.read_csv('myfile.txt`)
# group and apply functions
df = df.groupby([0,1,2])[7].agg([('count', 'size'), 
                                 ('genes', lambda col: ', '.join(col))
                                ]).reset_index()
# rename columns
df = df.rename({0: 'chromosome', 1: 'start_region', 2: 'end_region'}, axis=1)
# save new file
df.to_csv('newfile.txt', sep='\t', index=False, header=True)

这将创建一个如下所示的DataFrame：

      0         1         2   3  4   5   6        7  8
0  chrX   7970000   8670000   3  2   7   7  RPS6KA6  4
1  chrX   7970000   8670000   3  2   7   7    SATL1  3
2  chrX   7970000   8670000   3  2   7   7  SH3BGRL  4
3  chrX   7970000   8670000   3  2   7   7     VCX2  1
4  chrX  86580000  86980000   1  1   1   5    KLHL4  2
5  chrX  87370000  88620000   4  4  11  11   CPXCR1  2
6  chrX  87370000  88620000   4  4  11  11    FAM9A  2
7  chrX  89050000  91020000  11  6  10  13    FAM9B  3
8  chrX  89050000  91020000  11  6  10  13   PABPC5  2

现在，使用内置函数，我们可以在列groupby上[0, 1, 2]，并在组上应用函数，从而得出：

      0         1         2  count                          genes
0  chrX   7970000   8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2
1  chrX  86580000  86980000      1                          KLHL4
2  chrX  87370000  88620000      2                  CPXCR1, FAM9A
3  chrX  89050000  91020000      2                  FAM9B, PABPC5

这是对数据进行分组并添加我们感兴趣的列：

('count', 'size')使用功能size创建列count
('genes', lambda col: ', '.join(col))使用genes函数创建列lambda，该函数只是将分组的列连接在一起。

这是最终文件的样子：

chromosome  start_region  end_region  count                          genes
      chrX       7970000     8670000      4  RPS6KA6, SATL1, SH3BGRL, VCX2
      chrX      86580000    86980000      1                          KLHL4
      chrX      87370000    88620000      2                  CPXCR1, FAM9A
      chrX      89050000    91020000      2                  FAM9B, PABPC5

如有任何疑问，请访问pandas tag。

总结文本文件的内容

3 个答案: