Linux结合了两个不同的文本文件

时间:2014-08-26 21:31:39

标签: python linux perl awk sed

我希望使用awksed或其他工具获得如下所示的一项功能。

  1. 将两个文件(File1,File2)与ID进行比较。
  2. 如果相同的ID将相同的数据从File2带到File1。
  3. 例如,如下所示,

    第一个文件名:File1.txt
    内部(带有制表符分隔的表格格式)

    ID      Match     Length
    100      OK        1000
    200      OK        1000
    300      OK        2000
    400      OK        2000
    500      OK        3000
    

    第二个文件名:File2.fasta
    该信息包含如下信息

    >100
    ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
    >200
    CTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGA
    >300
    TGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAC
    >400
    GACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACT
    >500
    ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
    

    所以我想再从File2.fasta向File1.txt文件扩展一列 所以这是最终结果

    ID      Match     Length     Sequence
    100      OK        1000     ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
    200      OK        1000     CTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGA
    300      OK        2000     TGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAC
    400      OK        2000     GACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACT
    500      OK        3000     ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
    

    有没有人对如何组合这两个文件有任何好的想法?

2 个答案:

答案 0 :(得分:2)

我相信你正在寻找加入。

首先,您需要对文件进行排序,并采用通用格式(相同的分隔符)。

cat File2.fasta |sed 's/$/\t/g'|tr -d '\n' |sed 's/>/\n/g'|sort > File2.fasta.sorted
cat File1.txt|sort > File1.txt.sorted

然后,您只需要像这样加入:

join -a1 -t'$TAB' File1.txt.sorted File2.fasta.sorted

请注意,$ TAB表示制表符。

这会产生这样的结果:

100 OK  1000    ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG    
200 OK  1000    CTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGA    
300 OK  2000    TGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAC    
400 OK  2000    GACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACT    
500 OK  3000    ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG    
ID  Match   Length

您想要的是什么(列名/位置除外)。

答案 1 :(得分:0)

IFS=$(echo -en "\n\b") && i=1 && for a in $(cat File1.txt); do ((i)) && echo "$a Sequence" && i=0 || echo "$a $(sed -n "/$(echo $a | awk '{print $1}')/{n;p}" File2.fasta)"; done && unset IFS

循环文件,第一行只执行一次新标题,之后使用sed查找匹配后的下一行并在新列上回显它。