Question

我有另一个数据处理问题。所以我有这个tab限定数据的.gtf文件，我需要提取某些功能。这之前我只需要为每个基因中的每个“外显子”类型提取基因ID，POS1和POS2，这就更简单了。我需要做同样的事情，但是我首先需要找到每个外显子的POS1和POS2相对于它在基因中的位置。现在，POS1和POS2列基于整个基因组上TYPE的位置进行编号（这就是数字如此之高的原因）。还有另一个问题，如果股线是 - ，这是相反的。如果你看PITG_00002，你可以看到终止密码子似乎在起始密码子之前。这是因为所有内容都相对于+（模板）链编号。以下是数据表的示例：

GENE ID     TYPE        POS1    POS2    STRAND
PITG_00003  start_codon 38775   38777   +   0
PITG_00003  stop_codon  39069   39071   +   0
PITG_00003  exon        38775   39071   +   .
PITG_00003  CDS         38775   39068   +   0
PITG_00004  start_codon 39526   39528   +   0
PITG_00004  stop_codon  41492   41494   +   0
PITG_00004  exon        39526   40416   +   .
PITG_00004  CDS         39526   40416   +   0
PITG_00004  exon        40486   40771   +   .
PITG_00004  CDS         40486   40771   +   0
PITG_00004  exon        40827   41494   +   .
PITG_00004  CDS         40827   41491   +   2
PITG_00002  start_codon 10520   10522   -   0
PITG_00002  stop_codon  10097   10099   -   0
PITG_00002  exon        10474   10522   -   .
PITG_00002  CDS         10474   10522   -   0
PITG_00002  exon        10171   10433   -   .
PITG_00002  CDS         10171   10433   -   2
PITG_00002  exon        10097   10114   -   .
PITG_00002  CDS         10100   10114   -   0

因此，对于每个基因，我需要相对于“起始密码子”TYPE的位置在1处开始数字。不幸的是，对于 - STRAND（例如PITG_00002）上列出的基因，该数字是倒退的。因此对于这些情况，编号需要从相对于start_codon的POS2的1开始并且在外显子的POS1处结束。

所以对于每个外显子我需要一个新的POS1和POS2，我称之为POSA和POSB。

为了获得每个外显子的POSA，我会这样做：

POS1 of "exon" - POS1 of "start_codon" + 1 = POSA

为了获得每个外显子的POSB，我会这样做：

POS2 of "exon" - POS1 of "start_codon" + 1 = POSB

以PITG_00004为例：

POSA = 39526-39526 + 1 = 1
POSB = 40416 - 39526 + 1 = 891

然后对每个基因中的每个外显子做同样的事情，使用该基因的start_codon位置来重置编号。除了负链的情况，在这种情况下我必须这样做：

为了获得每个外显子的POSA，我会这样做：

POS2 of "start_codon" - POS2 of "exon" + 1 = POSA

为了获得每个外显子的POSB，我会这样做：

POS1 of "start_codon" - POS1 of "exon" + 1 = POSB

最终我想得到这个：

PITG_00002 exon 1 49
PITG_00002 exon 90 352
PITG_00002 exon 409 426
PITG_00003 exon 1 297
PITG_00004 exon 1 891
PITG_00004 exon 961 1246
PITG_00004 exon 1302 1969

我不确定如何为+链做另一种方式，为另一种方式做另一种方式。我最近经常使用python，但我也可以使用perl。

Answer 1

Perl解决方案。使用哈希来存储有关每个基因的信息。 @idxs数组用于避免重复公式。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw(switch);

my %hash;
<>;                   # Skip header.
while (<>) {
    my ($id, $type, $pos1, $pos2, $strand, undef) = split;
    given ($type) {
        when ('start_codon') {
            $hash{$id}{start}  = [$pos1, $pos2];
            $hash{$id}{strand} = $strand;
        }
        when ('stop_codon') {
            $hash{$id}{stop}  = [$pos1, $pos2];
        }
        when ('exon') {
            push @{ $hash{$id}{exons} }, [$pos1, $pos2];
        }
    }
}

for my $id (sort keys %hash) {
    my @idxs = '+' eq $hash{$id}{strand} ? (0, 1) : (1, 0);
    for my $exon (@{ $hash{$id}{exons} }) {
        my $posa = 1 + abs $hash{$id}{start}[$idxs[0]] - $exon->[$idxs[0]];
        my $posb = 3 + abs $hash{$id}{start}[$idxs[1]] - $exon->[$idxs[1]];
        print "$id exon $posa $posb\n";
    }
}

Answer 2

好的，这是您问题的解决方案（至少对于我所理解的）：它基于pandas库（http://pandas.pydata.org/），这是目前python中数据分析的黄金标准。

首先加载您的数据：

data = pd.read_csv('genetest.csv', sep='\t',
                   converters={'STRAND': lambda s: s[0]})

转换后只是将多余的字符从stra列中剥离出来，只留下+或 - 。

现在你去使用groupby函数来按序列方向和基因名称分离你的序列

groups = data.groupby(['STRAND', 'GENE_ID'])

这将使您的数据集返回具有相同链方向和基因名称的片段，并且您可以分别处理它们中的每一个。因此，我们将它们作为字典项（键，值对列表）进行迭代，并对它们进行操作。

corrected = []
for (direction, gene_name), group in groups:
    print direction,gene_name
    # take the index of the element you are going to subtract to the others
    start_exon = group.index[group.TYPE=='start_codon'][0]
    # now you perform your normalization and put it back into your group
    if direction == '+':
        group['POSA'] = 1 + group.POS1 - group.POS1[start_exon]
        group['POSB'] = 1 + group.POS2 - group.POS1[start_exon]
    else:
        group['POSA'] = 1 - group.POS2 + group.POS2[start_exon]
        group['POSB'] = 1 - group.POS1 + group.POS2[start_exon]
    print group
    # put into the result array
    corrected.append(group)
# join them together to obtain the whole dataset with the POSA and POSB
new_data = pd.concat(corrected)
print new_data

这就是你获得的：

    GENE_ID     TYPE    POS1    POS2    STRAND  POSA    POSB
0   PITG_00003  start_codon     38775   38777   +   1   3
1   PITG_00003  stop_codon  39069   39071   +   295     297
2   PITG_00003  exon    38775   39071   +   1   297
3   PITG_00003  CDS     38775   39068   +   1   294
4   PITG_00004  start_codon     39526   39528   +   1   3
5   PITG_00004  stop_codon  41492   41494   +   1967    1969
6   PITG_00004  exon    39526   40416   +   1   891
7   PITG_00004  CDS     39526   40416   +   1   891
8   PITG_00004  exon    40486   40771   +   961     1246
9   PITG_00004  CDS     40486   40771   +   961     1246
10  PITG_00004  exon    40827   41494   +   1302    1969
11  PITG_00004  CDS     40827   41491   +   1302    1966
12  PITG_00002  start_codon     10520   10522   -   1   3
13  PITG_00002  stop_codon  10097   10099   -   424     426
14  PITG_00002  exon    10474   10522   -   1   49
15  PITG_00002  CDS     10474   10522   -   1   49
16  PITG_00002  exon    10171   10433   -   90  352
17  PITG_00002  CDS     10171   10433   -   90  352
18  PITG_00002  exon    10097   10114   -   409     426
19  PITG_00002  CDS     10100   10114   -   409     423

顺便说一句，你在问题中写下了错误的距离修正，它应该是

POS1 of "start_codon" - POS2 of "exon" + 1 = POSB

对于具有几何意义的倒置字符串（并获取您发布的值）

将制表符分隔的数据重新编号为相对于第三列内容的两列

2 个答案: