我有一个数据文件(制表符分隔),如下所示:
chr1 38045559 38046059 chr1:38045559-38046559_NM_001142726_C1orf122_+,chr1:38045559-38046559_NM_198446_C1orf122_+,chr1:38045952-38046952_NM_024640_YRDC_-
chr1 205291045 205291545 chr1:205290545-205291545_NM_018566_YOD1_-
chr1 1499717 1500625 chr1:1499625-1500625_NM_014188_SSU72_-
chr1 1679941 1680441 chr1:1679441-1680441_NM_001198995_NADK_-
chr1 1699769 1700657 chr1:1699269-1700269_NM_023018_NADK_-,chr1:1699657-1700657_NM_001198993_NADK_-
chr1 1701368 1701868 chr1:1700868-1701868_NM_001198994_NADK_-
chr1 1812386 1812886 chr1:1811886-1812886_NM_002074_GNB1_-
chr1 2066155 2066655
chr1 2149493 2149993 chr1:2149493-2150493_NM_003036_SKI_+
chr1 2312573 2313353 chr1:2312353-2313353_NM_024848_MORN1_-,chr1:2312573-2313573_NM_007033_RER1_+
其中:
我想要的是按原样获取前三列,并从每个注释记录中获取第五个字段。例如,对于第1行,输出应该是这样的
chr1 38045559 38046059 C1orf122
C1orf122
YRDC
我做了什么,是否已根据逗号“,”
在最后一栏中拆分了我的数据使用tr ',' '\t' <input>temp1
现在我拥有的是一个看起来像这样的文件
chr1 38045559 38046059 chr1:38045559-38046559_NM_001142726_C1orf122_+ chr1:38045559-38046559_NM_198446_C1orf122_+ chr1:38045952-38046952_NM_024640_YRDC_-
chr1 205291045 205291545 chr1:205290545-205291545_NM_018566_YOD1_-
chr1 1499717 1500625 chr1:1499625-1500625_NM_014188_SSU72_-
chr1 1679941 1680441 chr1:1679441-1680441_NM_001198995_NADK_-
chr1 1699769 1700657 chr1:1699269-1700269_NM_023018_NADK_- chr1:1699657-1700657_NM_001198993_NADK_-
chr1 1701368 1701868 chr1:1700868-1701868_NM_001198994_NADK_-
chr1 1812386 1812886 chr1:1811886-1812886_NM_002074_GNB1_-
chr1 2066155 2066655
chr1 2149493 2149993 chr1:2149493-2150493_NM_003036_SKI_+
chr1 2312573 2313353 chr1:2312353-2313353_NM_024848_MORN1_- chr1:2312573-2313573_NM_007033_RER1_+
现在,我需要你的一些专业知识才能让我达到所需的输出格式。
请指导我在python / perl / shell中获取所需的输出。
答案 0 :(得分:1)
我会考虑使用脚本文件,因为它是调整和重用的主要候选者,至少是中等复杂的。此外,它还允许您更轻松地为代码使用合适的模块。 Text::CSV
将安全地读取您的csv文件,Text::ParseWords
将处理您的嵌套字段。
以下脚本用于演示。您可以将文件句柄*DATA
更改为*ARGV
,以使脚本解析参数文件,例如:
perl script.pl file.csv > output.txt
<强>代码:强>
use strict;
use warnings;
use Text::CSV;
use Text::ParseWords;
my $csv = Text::CSV->new({ # create csv object
sep_char => "\t", # delimiter is tab
});
while(my $row = $csv->getline(*DATA)) { # read from file handle
my @anno = quotewords(',', 0, $row->[-1]); # get list of fields
@anno = "" unless @anno; # avoid empty list
for (@anno) { # for each field
my @inner = quotewords('[:_]', 0, $_); # get inner fields
my $anno = $inner[-2] // ""; # take second last
print join "\t", @$row[0 .. 2], $anno;
print $/;
$_ = "" for @$row; # clear primary row once printed
}
}
__DATA__
chr1 38045559 38046059 chr1:38045559-38046559_NM_001142726_C1orf122_+,chr1:38045559-38046559_NM_198446_C1orf122_+,chr1:38045952-38046952_NM_024640_YRDC_-
chr1 205291045 205291545 chr1:205290545-205291545_NM_018566_YOD1_-
chr1 1499717 1500625 chr1:1499625-1500625_NM_014188_SSU72_-
chr1 1679941 1680441 chr1:1679441-1680441_NM_001198995_NADK_-
chr1 1699769 1700657 chr1:1699269-1700269_NM_023018_NADK_-,chr1:1699657-1700657_NM_001198993_NADK_-
chr1 1701368 1701868 chr1:1700868-1701868_NM_001198994_NADK_-
chr1 1812386 1812886 chr1:1811886-1812886_NM_002074_GNB1_-
chr1 2066155 2066655
chr1 2149493 2149993 chr1:2149493-2150493_NM_003036_SKI_+
chr1 2312573 2313353 chr1:2312353-2313353_NM_024848_MORN1_-,chr1:2312573-2313573_NM_007033_RER1_+
答案 1 :(得分:1)
我建议使用这个解决方案,它希望输入文件作为命令行上的参数
use strict;
use warnings;
while (<>) {
chomp;
my @columns = split "\t";
print join "\t", @columns[0, 1, 2];
unless ($columns[3]) {
print "\n";
next;
}
my @records = split /,/, $columns[3];
my $tabs = 1;
for (@records) {
my @notes = split /[_:]/;
print "\t" x $tabs;
print $notes[4], "\n";
$tabs = 4;
}
}
<强>输出强>
chr1 38045559 38046059 C1orf122
C1orf122
YRDC
chr1 205291045 205291545 YOD1
chr1 1499717 1500625 SSU72
chr1 1679941 1680441 NADK
chr1 1699769 1700657 NADK
NADK
chr1 1701368 1701868 NADK
chr1 1812386 1812886 GNB1
chr1 2066155 2066655
chr1 2149493 2149993 SKI
chr1 2312573 2313353 MORN1
RER1
请注意,未对齐是因为变长字段的制表符分隔
<强>更新强>
此版本在数组中累积输出并评估每列的最大宽度,以便可以使用适当的固定字段宽度显示
use strict;
use warnings;
my @output;
while (<>) {
chomp;
my @columns = split "\t";
my @outrec = @columns[0,1,2];
if ($columns[3]) {
my @records = split /,/, $columns[3];
for (@records) {
my @notes = split /[_:]/;
$outrec[3] = $notes[4];
push @output, [ @outrec ];
@outrec = ();
}
}
else {
push @output, \@outrec;
}
}
my @sizes;
for (@output) {
for my $i (0..3) {
my $length = length($_->[$i] // '');
$sizes[$i] = $length unless $sizes[$i] and $sizes[$i] > $length;
}
}
for my $outrec (@output) {
printf "%-*s %-*s %-*s %-*s\n", map { $sizes[$_], $outrec->[$_] // ''} 0..3;
}
<强>输出强>
chr1 38045559 38046059 C1orf122
C1orf122
YRDC
chr1 205291045 205291545 YOD1
chr1 1499717 1500625 SSU72
chr1 1679941 1680441 NADK
chr1 1699769 1700657 NADK
NADK
chr1 1701368 1701868 NADK
chr1 1812386 1812886 GNB1
chr1 2066155 2066655
chr1 2149493 2149993 SKI
chr1 2312573 2313353 MORN1
RER1
答案 2 :(得分:0)
这个Perl解决方案是否符合您的要求?您可能需要调整它:
perl -ane '
@names = split /,/, $F[-1];
print +(join "\t", @F[0 .. 2], join "\n\t\t\t", map +(split /_/)[3], @names), "\n";
'
<强>更新强>
-n
告诉Perl逐行处理输入。
-a
告诉Perl将每一行拆分为数组@F。
最后一个字段在,
上拆分为新数组@names。然后,打印前三个字段,然后是换行符和三个选项卡连接的名称。每个名称都是从@names中获得的,方法是将它拆分为下划线并返回第四个成员。
perl -MText::Table -ane '
BEGIN { $t = Text::Table->new }
@names = split /,/, $F[-1];
@n = map +(split /_/)[3], @names;
my $f;
$t->add($f++ ? (("") x 3)
: @F[0 .. 2], $_)
for @n ? @n : ("")
}{
print $t'