基本上,我需要一个可以在很短的时间内解决问题的脚本。我有两个文件:
$ head -n 6 fcu.tsv
NM576455 0.324009324 0.578896174 2577
NM539570 0.204545455 0.607877092 2247
NM337132 0.288973384 0.673636364 792
NM374379 0.308300395 0.42 762
NM373443 0.263043478 0.547132867 1383
NM371839 0.298210736 0.492857143 1512
$ head -n 6 mart.tsv
NM539570 ILMN_2199362 15 58.52 protein_coding
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component SAM_2
NM576455 ILMN_2195138 1 65.74 protein_coding protein binding molecular_function SAM_2
NM576455 ILMN_1709067 1 65.74 protein_coding nucleus cellular_component SAM_2
NM576455 ILMN_1709067 1 65.74 protein_coding protein binding molecular_function SAM_2
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component SAM_type1
我们需要在很短的时间内将fcu.tsv的第2,第3和第4个字段附加到mart.tsv以获取每个NM ID。
$ head out.tsv
NM539570 ILMN_2199362 15 58.52 protein_coding 0.204545455 0.607877092 2247
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding protein binding molecular_function SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_1709067 1 65.74 protein_coding nucleus cellular_component SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_1709067 1 65.74 protein_coding protein binding molecular_function SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component SAM_type1 0.324009324 0.578896174 2577
这就是我在matlab中所做的(我更喜欢这个解决方案来解决这里的错误代码,使其更快而不是写一个新代码)
fr1 = fopen('fcu.tsv', 'r');
fr2 = fopen('mart.tsv', 'r');
fw = fopen('out.tsv', 'w');
while feof(fr1) == 0
line = fgetl(fr1);
scan = textscan(line, '%s%f%f%d');
frewind(fr2);
while feof(fr2) == 0
line2 = fgetl(fr2);
scan2 = textscan(line2, '%s%s%s%f%s%s%s%s');
if scan{1}{1} == scan2{1}{1}
fprintf(fw, '%s\t%f\t%f\t%d\n', line2, scan{2}, scan{3}, scan{4});
end
end
end
感谢帮助
答案 0 :(得分:2)
使用awk
的一种方法。对于案例FNR == NR
,它会读取参数的第一个输入文件(fcu.tsv
)并保存在哈希中,第一个字段作为键,其余字段以\t
作为值连接。对于FNR < NR
读取mart.tsv
,如果第一个字段与散列的键匹配,请在行尾添加其值,否则打印原始行。
script.awk
的内容:
BEGIN {
OFS = "\t"
}
FNR == NR {
for ( i = 2; i <= NF; i++ ) {
line = (line ? line OFS : "") $i
}
fcu[ $1 ] = line
line = ""
next
}
FNR < NR {
if ( $1 in fcu ) {
print $0 OFS fcu[ $1 ]
}
else {
print $0
}
}
像以下一样运行:
awk -f script.awk fcu.tsv mart.tsv
使用以下输出:
NM539570 ILMN_2199362 15 58.52 protein_coding 0.204545455 0.607877092 2247
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding protein binding molecular_function SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_1709067 1 65.74 protein_coding nucleus cellular_component SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_1709067 1 65.74 protein_coding protein binding molecular_function SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component SAM_type1 0.324009324 0.578896174 2577
答案 1 :(得分:0)
这是一个以命令行为中心的解决方案,适用于支持coreutils
的任何系统,如果它不适用于您的情况,则道歉。
如果正确填充mart.tsv
,请执行以下操作:
NM539570 ILMN_2199362 15 58.52 protein_coding NA NA NA NA
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component NA SAM_2
NM576455 ILMN_2195138 1 65.74 protein_coding protein binding molecular_function SAM_2
NM576455 ILMN_1709067 1 65.74 protein_coding nucleus cellular_component NA SAM_2
NM576455 ILMN_1709067 1 65.74 protein_coding protein binding molecular_function SAM_2
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component NA SAM_type1
解决方案可能很简单join
(请参阅info join
):
$ join <(sort mart.tsv) <(sort fcu.tsv) | column -t
NM539570 ILMN_2199362 15 58.52 protein_coding NA NA NA NA 0.204545455 0.607877092 2247
NM576455 ILMN_1709067 1 65.74 protein_coding nucleus cellular_component NA SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_1709067 1 65.74 protein_coding protein binding molecular_function SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component NA SAM_2 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding nucleus cellular_component NA SAM_type1 0.324009324 0.578896174 2577
NM576455 ILMN_2195138 1 65.74 protein_coding protein binding molecular_function SAM_2 0.324009324 0.578896174 2577
column
来自bsdmainutils
包。