Question

我有选项卡分隔的输入文件，其中第二列的一些内容是空间分隔的，因此在以空格作为分隔符的两列之间划分，例如“LEA类型”完全属于同一第二列但是它以这样的方式划分“ LEA“进入第二列，”类型“进入第三列，类似”核糖体蛋白L21P“同名应该在第二列，但分为第二，第三和第四列。

1st_col     2nd_col     3rd_col    4th_col  5th_col 6th_col
tATAAAta    TBP         ~           1       
tACCAT      Ribosomal   protein     L21P    ~   2
agtACCAT    Ribosomal   protein     L21P    ~   2
ATGTActt    AP2         ~           1       
GCAACggagc  LEA         type        1       ~   1
ATGGTa      Ribosomal   protein     L21P    ~   1
ATGGTctt    Ribosomal   protein     L21P    ~   2
ATGGTaca    Ribosomal   protein     L21P    ~   1

期望的输出sholud是这样的，所以“LEA类型”应该在第二列中，如“LEA_type”，其他单元格的位置和内容不会移位。

1st_col     2nd_col                 3rd_col 4th_col 5th_col 6th_col
tATAAAta    TBP                     ~       1
tACCAT      Ribosomal_protein_L21P  ~       2
agtACCAT    Ribosomal_protein_L21P  ~       2
ATGTActt    AP2                     ~       1
GCAACggagc  LEA_type                ~       1
ATGGTa      Ribosomal_protein_L21P  ~       1
ATGGTctt    Ribosomal_protein_L21P  ~       2
ATGGTaca    Ribosomal_protein_L21P  ~       1

我尝试过类似的东西，但它也会导致其他细胞移位。

 sed 's/LEA\stype/LEA_type/g' 1_com_final_2922.txt | sed 's/Ribosomal\sprotein/Ribosomal_protein/g'

提前致谢。

Answer 1

您的问题不是100％明确，但根据您显示的输出和解释的条件，下面将查找字符串LEA，类型和核糖体，蛋白质，L21P并根据您显示的输出组合它们。

awk '($2=="LEA" && $3=="type"){$2="LEA_type";$3=""} ($2=="Ribosomal" && $3=="protein" && $4=="L21P"){$2="Ribosomal_protein_L21P";$3=$4=""} 1'  Input_file

输出如下。

tATAAAta    TBP ~   1   Ca_00015    Ca_00015
0   0   0   0   Ca_00027    Ca_00027
atTTACCgaa  Trihelix    ~   2   Ca_00027    Ca_00027
0   0   0   0   Ca_00027    Ca_00027
0   0   0   0   Ca_00027    Ca_00027
tACCAT Ribosomal_protein_L21P   ~ 2
agtACCAT Ribosomal_protein_L21P   ~ 2
GCAACggagc LEA_type  1 ~ 1
ATGGTa Ribosomal_protein_L21P   ~ 1
ATGGTctt Ribosomal_protein_L21P   ~ 2
ATGGTaca Ribosomal_protein_L21P   ~ 1
GCAACctccc LEA_type  1 ~ 1

添加非单线形式的解决方案。

awk '
($2=="LEA" && $3=="type"){
  $2="LEA_type";
  $3=""
}
($2=="Ribosomal" && $3=="protein" && $4=="L21P"){
  $2="Ribosomal_protein_L21P";
  $3=$4=""
}
1
'  Input_file

编辑：由于OP已经改变了，所以请稍微更改代码，如下所示。另请使用awk -F＆＃34; \ t＆＃34;以防你的Input_file是TAB分隔。

awk '
($2=="LEA" && $3=="type"){
  $2="LEA_type";
  $3=$4="";
}
($2=="Ribosomal" && $3=="protein" && $4=="L21P"){
  $2="Ribosomal_protein_L21P";
  $3=$4="";
}
1
' Input_file | column  -t

输出如下。

1st_col     2nd_col                 3rd_col  4th_col  5th_col  6th_col
tATAAAta    TBP                     ~        1
tACCAT      Ribosomal_protein_L21P  ~        2
agtACCAT    Ribosomal_protein_L21P  ~        2
ATGTActt    AP2                     ~        1
GCAACggagc  LEA_type                ~        1
ATGGTa      Ribosomal_protein_L21P  ~        1
ATGGTctt    Ribosomal_protein_L21P  ~        2
ATGGTaca    Ribosomal_protein_L21P  ~        1

Answer 2

这是更灵活的方式，

awk '$2~/[^0-9|^~]+/{                  # search the line which $2 is not numeric nor tide
  for(i=3;i<=NF;i++){             # continue to search start from $3
    if($i~/[^0-9|^~]+/){          # if $i is not numeric nor tide
      $2=sprintf("%s_%s",$2,$i);  # substitute $2 as $2_$i 
      $i=""                       # set $i=""
    } else                        # if hit something numeric or tide, we break
      break
  }
}1'

这是一个单行，

awk '$2~/[^0-9|^~]+/{for(i=3;i<=NF;i++){ if($i~/[^0-9|^~]+/){ $2=sprintf("%s_%s",$2,$i); $i="" } else break } }1' file

修改

更新了更新的OP的答案

awk '$3~/[^~]/ && NR>1{for(i=3;i<=NF;i++){ if($i~/[^0-9|^~]+/){ $2=sprintf("%s_%s",$2,$i); $i="" } else{$3="~"; $4=$(i+1); $i=""; $(i+1)=""; break} } }1' file5 | column -t

如何在不移动列的其余单元格的情况下合并特定行中两列的内容

2 个答案: