根据ID将文件中的字段附加到另一个文件中

时间:2012-07-15 09:42:14

标签: matlab awk

基本上,我需要一个可以在很短的时间内解决问题的脚本。我有两个文件:

$ head -n 6 fcu.tsv

NM576455     0.324009324     0.578896174     2577
NM539570     0.204545455     0.607877092     2247
NM337132     0.288973384     0.673636364     792
NM374379     0.308300395     0.42            762
NM373443     0.263043478     0.547132867     1383
NM371839     0.298210736     0.492857143     1512

$ head -n 6 mart.tsv

NM539570 ILMN_2199362    15      58.52   protein_coding
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding molecular_function      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding molecular_function      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_type1

我们需要在很短的时间内将fcu.tsv的第2,第3和第4个字段附加到mart.tsv以获取每个NM ID。

$ head out.tsv

NM539570 ILMN_2199362    15      58.52   protein_coding  0.204545455     0.607877092     2247
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding molecular_function      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding molecular_function      SAM_2 0.324009324   0.578896174     2577
    NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_type1 0.324009324   0.578896174     2577

这就是我在matlab中所做的(我更喜欢这个解决方案来解决这里的错误代码,使其更快而不是写一个新代码)

fr1 = fopen('fcu.tsv', 'r');
fr2 = fopen('mart.tsv', 'r');

fw = fopen('out.tsv', 'w');

while feof(fr1) == 0
   line = fgetl(fr1);
   scan = textscan(line, '%s%f%f%d');

   frewind(fr2);

    while feof(fr2) == 0
        line2 = fgetl(fr2);
        scan2 = textscan(line2, '%s%s%s%f%s%s%s%s');

            if scan{1}{1} == scan2{1}{1} 

                fprintf(fw, '%s\t%f\t%f\t%d\n', line2, scan{2}, scan{3}, scan{4});

            end

    end

end

感谢帮助

2 个答案:

答案 0 :(得分:2)

使用awk的一种方法。对于案例FNR == NR,它会读取参数的第一个输入文件(fcu.tsv)并保存在哈希中,第一个字段作为键,其余字段以\t作为值连接。对于FNR < NR读取mart.tsv,如果第一个字段与散列的键匹配,请在行尾添加其值,否则打印原始行。

script.awk的内容:

BEGIN {
    OFS = "\t"
}

FNR == NR {
    for ( i = 2; i <= NF; i++ ) { 
        line = (line ? line OFS : "") $i
    }   
    fcu[ $1 ] = line 
    line = ""
    next
}

FNR < NR {
    if ( $1 in fcu ) { 
        print $0 OFS fcu[ $1 ]
    }   
    else {
        print $0
    }   
}

像以下一样运行:

awk -f script.awk fcu.tsv mart.tsv

使用以下输出:

NM539570 ILMN_2199362    15      58.52   protein_coding 0.204545455     0.607877092     2247
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding molecular_function      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding molecular_function      SAM_2  0.324009324     0.578896174     2577
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component      SAM_type1      0.324009324     0.578896174     2577

答案 1 :(得分:0)

这是一个以命令行为中心的解决方案,适用于支持coreutils的任何系统,如果它不适用于您的情况,则道歉。

如果正确填充mart.tsv,请执行以下操作:

NM539570 ILMN_2199362    15      58.52   protein_coding  NA      NA                 NA                      NA
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component NA                      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  protein binding            molecular_function      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  nucleus cellular_component NA                      SAM_2
NM576455 ILMN_1709067    1       65.74   protein_coding  protein binding            molecular_function      SAM_2
NM576455 ILMN_2195138    1       65.74   protein_coding  nucleus cellular_component NA                      SAM_type1

解决方案可能很简单join(请参阅info join):

$ join <(sort mart.tsv) <(sort fcu.tsv) | column -t
NM539570  ILMN_2199362  15  58.52  protein_coding  NA       NA                  NA                  NA         0.204545455  0.607877092  2247
NM576455  ILMN_1709067  1   65.74  protein_coding  nucleus  cellular_component  NA                  SAM_2      0.324009324  0.578896174  2577
NM576455  ILMN_1709067  1   65.74  protein_coding  protein  binding             molecular_function  SAM_2      0.324009324  0.578896174  2577
NM576455  ILMN_2195138  1   65.74  protein_coding  nucleus  cellular_component  NA                  SAM_2      0.324009324  0.578896174  2577
NM576455  ILMN_2195138  1   65.74  protein_coding  nucleus  cellular_component  NA                  SAM_type1  0.324009324  0.578896174  2577
NM576455  ILMN_2195138  1   65.74  protein_coding  protein  binding             molecular_function  SAM_2      0.324009324  0.578896174  2577

column来自bsdmainutils包。