编辑最后一列中的数据

时间:2012-10-08 09:49:32

标签: python perl bash shell

我有一个数据文件(制表符分隔),如下所示:

chr1    38045559    38046059    chr1:38045559-38046559_NM_001142726_C1orf122_+,chr1:38045559-38046559_NM_198446_C1orf122_+,chr1:38045952-38046952_NM_024640_YRDC_-
chr1    205291045   205291545   chr1:205290545-205291545_NM_018566_YOD1_-
chr1    1499717 1500625 chr1:1499625-1500625_NM_014188_SSU72_-
chr1    1679941 1680441 chr1:1679441-1680441_NM_001198995_NADK_-
chr1    1699769 1700657 chr1:1699269-1700269_NM_023018_NADK_-,chr1:1699657-1700657_NM_001198993_NADK_-
chr1    1701368 1701868 chr1:1700868-1701868_NM_001198994_NADK_-
chr1    1812386 1812886 chr1:1811886-1812886_NM_002074_GNB1_-
chr1    2066155 2066655 
chr1    2149493 2149993 chr1:2149493-2150493_NM_003036_SKI_+
chr1    2312573 2313353 chr1:2312353-2313353_NM_024848_MORN1_-,chr1:2312573-2313573_NM_007033_RER1_+

其中:

  • 前三列是坐标和
  • 最后一列包含一组零个或多个注释记录
    • 每个注释记录由逗号分隔
    • 注释记录中的字段由下划线或冒号分隔

我想要的是按原样获取前三列,并从每个注释记录中获取第五个字段。例如,对于第1行,输出应该是这样的

chr1    38045559    38046059   C1orf122
                               C1orf122
                               YRDC

我做了什么,是否已根据逗号“,”

在最后一栏中拆分了我的数据

使用tr ',' '\t' <input>temp1

现在我拥有的是一个看起来像这样的文件

chr1    38045559    38046059    chr1:38045559-38046559_NM_001142726_C1orf122_+  chr1:38045559-38046559_NM_198446_C1orf122_+ chr1:38045952-38046952_NM_024640_YRDC_-
chr1    205291045   205291545   chr1:205290545-205291545_NM_018566_YOD1_-
chr1    1499717 1500625 chr1:1499625-1500625_NM_014188_SSU72_-
chr1    1679941 1680441 chr1:1679441-1680441_NM_001198995_NADK_-
chr1    1699769 1700657 chr1:1699269-1700269_NM_023018_NADK_-   chr1:1699657-1700657_NM_001198993_NADK_-
chr1    1701368 1701868 chr1:1700868-1701868_NM_001198994_NADK_-
chr1    1812386 1812886 chr1:1811886-1812886_NM_002074_GNB1_-
chr1    2066155 2066655 
chr1    2149493 2149993 chr1:2149493-2150493_NM_003036_SKI_+
chr1    2312573 2313353 chr1:2312353-2313353_NM_024848_MORN1_-  chr1:2312573-2313573_NM_007033_RER1_+

现在,我需要你的一些专业知识才能让我达到所需的输出格式。

请指导我在python / perl / shell中获取所需的输出。

3 个答案:

答案 0 :(得分:1)

我会考虑使用脚本文件,因为它是调整和重用的主要候选者,至少是中等复杂的。此外,它还允许您更轻松地为代码使用合适的模块。 Text::CSV将安全地读取您的csv文件,Text::ParseWords将处理您的嵌套字段。

以下脚本用于演示。您可以将文件句柄*DATA更改为*ARGV,以使脚本解析参数文件,例如:

perl script.pl file.csv > output.txt

<强>代码:

use strict;
use warnings;
use Text::CSV;
use Text::ParseWords;

my $csv = Text::CSV->new({                     # create csv object
        sep_char => "\t",                      # delimiter is tab
    });  

while(my $row = $csv->getline(*DATA)) {        # read from file handle
    my @anno = quotewords(',', 0, $row->[-1]); # get list of fields
    @anno = "" unless @anno;                   # avoid empty list
    for (@anno) {                              # for each field
        my @inner = quotewords('[:_]', 0, $_);    # get inner fields
        my $anno = $inner[-2] // "";              # take second last
        print join "\t", @$row[0 .. 2], $anno;
        print $/;
        $_ = "" for @$row;                     # clear primary row once printed
    }
}
__DATA__
chr1    38045559    38046059    chr1:38045559-38046559_NM_001142726_C1orf122_+,chr1:38045559-38046559_NM_198446_C1orf122_+,chr1:38045952-38046952_NM_024640_YRDC_-
chr1    205291045   205291545   chr1:205290545-205291545_NM_018566_YOD1_-
chr1    1499717 1500625 chr1:1499625-1500625_NM_014188_SSU72_-
chr1    1679941 1680441 chr1:1679441-1680441_NM_001198995_NADK_-
chr1    1699769 1700657 chr1:1699269-1700269_NM_023018_NADK_-,chr1:1699657-1700657_NM_001198993_NADK_-
chr1    1701368 1701868 chr1:1700868-1701868_NM_001198994_NADK_-
chr1    1812386 1812886 chr1:1811886-1812886_NM_002074_GNB1_-
chr1    2066155 2066655 
chr1    2149493 2149993 chr1:2149493-2150493_NM_003036_SKI_+
chr1    2312573 2313353 chr1:2312353-2313353_NM_024848_MORN1_-,chr1:2312573-2313573_NM_007033_RER1_+

答案 1 :(得分:1)

我建议使用这个解决方案,它希望输入文件作为命令行上的参数

use strict;
use warnings;

while (<>) {
  chomp;
  my @columns = split "\t";

  print join "\t", @columns[0, 1, 2];

  unless ($columns[3]) {
    print "\n";
    next;
  }

  my @records = split /,/, $columns[3];
  my $tabs = 1;
  for (@records) {
    my @notes = split /[_:]/;
    print "\t" x $tabs;
    print $notes[4], "\n";
    $tabs = 4;
  }
}

<强>输出

chr1    38045559    38046059    C1orf122
                C1orf122
                YRDC
chr1    205291045   205291545   YOD1
chr1    1499717 1500625 SSU72
chr1    1679941 1680441 NADK
chr1    1699769 1700657 NADK
                NADK
chr1    1701368 1701868 NADK
chr1    1812386 1812886 GNB1
chr1    2066155 2066655
chr1    2149493 2149993 SKI
chr1    2312573 2313353 MORN1
                RER1

请注意,未对齐是因为变长字段的制表符分隔

<强>更新

此版本在数组中累积输出并评估每列的最大宽度,以便可以使用适当的固定字段宽度显示

use strict;
use warnings;

my @output;

while (<>) {
  chomp;
  my @columns = split "\t";
  my @outrec = @columns[0,1,2];

  if ($columns[3]) {

    my @records = split /,/, $columns[3];
    for (@records) {
      my @notes = split /[_:]/;
      $outrec[3] = $notes[4];
      push @output, [ @outrec ];
      @outrec = ();
    }
  }
  else {
      push @output, \@outrec;
  }
}

my @sizes;
for (@output) {
  for my $i (0..3) {
    my $length = length($_->[$i] // '');
    $sizes[$i] = $length unless $sizes[$i] and $sizes[$i] > $length;
  }
}

for my $outrec (@output) {
  printf "%-*s %-*s %-*s %-*s\n", map { $sizes[$_], $outrec->[$_] // ''} 0..3;
}

<强>输出

chr1 38045559  38046059  C1orf122
                         C1orf122
                         YRDC    
chr1 205291045 205291545 YOD1    
chr1 1499717   1500625   SSU72   
chr1 1679941   1680441   NADK    
chr1 1699769   1700657   NADK    
                         NADK    
chr1 1701368   1701868   NADK    
chr1 1812386   1812886   GNB1    
chr1 2066155   2066655           
chr1 2149493   2149993   SKI     
chr1 2312573   2313353   MORN1   
                         RER1    

答案 2 :(得分:0)

这个Perl解决方案是否符合您的要求?您可能需要调整它:

perl -ane '
    @names = split /,/, $F[-1];
    print +(join "\t", @F[0 .. 2], join "\n\t\t\t", map +(split /_/)[3], @names), "\n";
'

<强>更新

-n告诉Perl逐行处理输入。

-a告诉Perl将每一行拆分为数组@F。

最后一个字段在,上拆分为新数组@names。然后,打印前三个字段,然后是换行符和三个选项卡连接的名称。每个名称都是从@names中获得的,方法是将它拆分为下划线并返回第四个成员。

使用Text :: Table格式化输出:

perl -MText::Table -ane '
    BEGIN { $t = Text::Table->new }
    @names = split /,/, $F[-1];
    @n = map +(split /_/)[3], @names;
    my $f;
    $t->add($f++ ? (("") x 3)
                 : @F[0 .. 2], $_)
        for  @n ? @n : ("")
    }{
       print $t'