在bash中连接多个tsv文件中的选定列

时间:2015-11-13 23:48:07

标签: bash csv awk sed cut

我有一堆制表符分隔的文本文件,如下所示

"gene_id"   "Pattern1"  "Pattern2"  "Pattern3"  "Pattern4"  "Pattern5"  "MAP"   "PPDE"
"ENSG00000119771.13"    3.11528786599051e-18    2.52650109640992e-13    6.25109524320237e-09    0.345846257420197   0.654153736328455   "Pattern5"  1
"ENSG00000123700.4" 1.75016991626305e-36    3.98804090894939e-19    0.63423772228367    3.8159144080782e-21 0.36576227771633    "Pattern3"  1
"ENSG00000128567.15"    1.10722918612618e-23    7.62691311068806e-07    5.77031364194955e-06    5.13675840911147e-21    0.999993466995047   "Pattern5"  1
"ENSG00000130182.6" 9.75717082221716e-22    1.27675651077242e-12    0.469972541094369   1.13677117238758e-12    0.530027458903217   "Pattern5"  1
"ENSG00000131914.9" 3.1627489688037e-41 1.00274706758683e-22    0.0578584524816503  6.98718794692175e-22    0.94214154751835    "Pattern5"  1

现在我想将它们加入到一个文件中,以便我得到

"gene_id"   "Pattern5"  "Pattern5"  "Pattern5"  "Pattern5"  "Pattern5"  

其中每个Pattern5列来自一个文件。

我用

尝试了一些东西
cut -f 6 <file>

paste <file1> <file2> ...

但我无法正确组合。

谢谢你的帮助!

更新 我试着在这里给你一个可测试的例子作为输入:

<file1>
gene_id Pattern1    Pattern2    Pattern3    Pattern4    Pattern5
ENSG00000119771 1   2   3   4   5
ENSG00000123700 1   2   3   4   5
ENSG00000128567 1   2   3   4   5
ENSG00000130182 1   2   3   4   5
ENSG00000131914 1   2   3   4   5

<file2>         
gene_id Pattern1    Pattern2    Pattern3    Pattern4    Pattern5
ENSG00000119771 6   7   8   9   10
ENSG00000123700 6   7   8   9   10
ENSG00000128567 6   7   8   9   10
ENSG00000130182 6   7   8   9   10
ENSG00000131914 6   7   8   9   10

<file3>             
gene_id Pattern1    Pattern2    Pattern3    Pattern4    Pattern5
ENSG00000119771 11  12  13  14  15
ENSG00000123700 11  12  13  14  15
ENSG00000128567 11  12  13  14  15
ENSG00000130182 11  12  13  14  15
ENSG00000131914 11  12  13  14  15

,所需的输出将是

gene_id Pattern5_file1  Pattern5_file2  Pattern5_file3
ENSG00000119771 5   10  15
ENSG00000123700 5   10  15
ENSG00000128567 5   10  15
ENSG00000130182 5   10  15
ENSG00000131914 5   10  15

UPDATE2: 我尝试了Ed Morton的方法:

awk '
BEGIN { FS=OFS="\t" } FNR==1{ARGIND++}
{ genes[$1]; val[$1,ARGIND] = $5 }
END {
    for (gene in genes) {
        printf "%s%s", gene, OFS
        for (file=1; file<=ARGIND; file++) {
            printf "%s%s", val[gene,file], (file<ARGIND?OFS:ORS)
        }
    }
} ' $files

但输出格式不正确:

ENSG00000128567 4   9   14
ENSG00000130182 4   9   14
ENSG00000119771 4   9   14
gene_id Pattern4    Pattern4    Pattern4
ENSG00000131914 4   9   14
ENSG00000123700 4   9   14

2 个答案:

答案 0 :(得分:2)

尝试这个

#!/bin/bash

paste file1 file2 file3 | awk -v patternIdx=6 '

function printPattern(idx, isFirstLine) {
    for (i = 1; i <= NF; ++i) { 
        if (i == 1) 
            printf "%s ", $i;
        else if (isFirstLine && i % patternIdx == 0)
            printf "%s_file%d ", $i, i / patternIdx;
        else if (i % patternIdx == 0)
            printf "%d ", $i;
    }
    printf "\n"
} 
{ 
    if (NR == 1)
        printPattern(patternIdx, 1);
    else
        printPattern(patternIdx, 0); 
}'

patternIdx是Pattern5

的列索引

答案 1 :(得分:1)

for f in file1 file2 file3; do 
    cut -f 6 $f; done | 
awk '{if ($1~/Pattern5/) {printf("\n%s\t",$1)} else {printf("%s\t",$1)} };END{print ""}' | 
tail -n +2

&#34; Pattern5&#34; 0.654153736328455 0.36576227771633 0.999993466995047
&#34; Pattern5&#34; 0.654153736328455 0.36576227771633 0.999993466995047
&#34; Pattern5&#34; 0.654153736328455 0.36576227771633 0.999993466995047

(我刚刚使用了与file1-3相同的数据。) 您也可以使用glob指定输入文件(如果它们是经常命名的),例如: for f in myfiles*