通过拆分特定列值来重复行

时间:2017-05-18 06:42:42

标签: linux unix awk sed row

伙计们,我有一个这样的文件:

Sequence ID TFBS_ID Binding sequence    TF_family   TF ID
CaCLV3_1    TFmatrixID_0009 taaaTTATTt  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0009 aAATAAatat  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0022 atcGGTAAct  Trihelix    AT5G28300
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  Homeodomain;bZIP;HD-ZIP AT3G61890

我想通过拆分TF_family列重复整行,该列有多个单独的系列,用“;”分隔我想要这样的输出,任何帮助:

Sequence ID TFBS_ID Binding sequence    TF_family   TF ID
CaCLV3_1    TFmatrixID_0009 taaaTTATTt  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0009 aAATAAatat  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0022 atcGGTAAct  Trihelix    AT5G28300
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  Homeodomain AT3G61890
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  bZIP    AT3G61890
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  HD-ZIP  AT3G61890

1 个答案:

答案 0 :(得分:1)

awk 方法:

awk 'NR==1{print}NR>1{split($4,a,";"); for(i=1;i<=length(a);i++){$4=a[i]; print $0}}' file

输出:

Sequence ID TFBS_ID Binding sequence    TF_family   TF ID
CaCLV3_1 TFmatrixID_0009 taaaTTATTt AT-Hook AT4G35390
CaCLV3_1 TFmatrixID_0009 aAATAAatat AT-Hook AT4G35390
CaCLV3_1 TFmatrixID_0022 atcGGTAAct Trihelix AT5G28300
CaCLV3_1 TFmatrixID_0025 tcAATCAatt Homeodomain AT3G61890
CaCLV3_1 TFmatrixID_0025 tcAATCAatt bZIP AT3G61890
CaCLV3_1 TFmatrixID_0025 tcAATCAatt HD-ZIP AT3G61890
  • NR==1{print} - 按原样打印第一行

  • split($4,a,";") - 将第4个字段拆分为;

  • for(i=1;i<=length(a);i++){$4=a[i]; print $0} - 重复每个子值的当前记录