Question

我在文件中提取感兴趣的模式。在每一行中我都重复了模式，我想以逗号分隔格式为每一行排序所有重复模式。例如：在每一行中我都有一个这样的字符串：

Line1：InterPro：IPR000504 InterPro：IPR003954 InterPro：IPR012677 Pfam：PF00076 PROSITE：PS50102 SMART：SM00360 SMART：SM00361 EMBL：CP002684 Proteomes：UP000006548 GO：GO：0009507 GO：GO：0003723 GO：GO：0000166 Gene3D：3.30 .70.330 SUPFAM：SSF54928 eggNOG：KOG0118 eggNOG：COG0724 InterPro：IPR003954

Line2：InterPro：IPR000306 InterPro：IPR002423 InterPro：IPR002498 Pfam：PF00118 Pfam：PF01363 Pfam：PF01504 PROSITE：PS51455 SMART：SM00064 SMART：SM00330 InterPro：IPR013083蛋白质组：UP000006548 GO：GO：0005739 GO：GO：0005524 EMBL： CP002686 GO：GO：0009555 GO：GO：0046872 GO：GO：0005768 GO：GO：0010008 Gene3D：3.30.40.10 InterPro：IPR017455

我想为每一行提取所有InterPro ID，如下所示：

IPR000504，IPR003954，IPR012677，IPR003954

IPR000306，IPR002423，IPR002498，IPR013083，IPR017455

我使用过这个脚本：

while read line; do
    NUM=$(echo $line | grep -oP 'InterPro:\K[^ ]+' | wc -l)
    if [ $NUM -eq 0 ];then
       echo "NA" >> InterPro.txt;
    fi; 
    if [ ! $NUM -eq 0 ];then
       echo $line | grep -oP 'InterPro:\K[^ ]+' | tr '\n' ',' >> InterPro.txt;
    fi;
done <./File.txt

问题是，一旦我运行此脚本，File.txt中的所有模式值都会打印在一行中。我希望所有感兴趣的每个行的模式值都以分开的行打印。

提前谢谢

Answer 1

使用awk：

awk '{for (i=1; i<=NF; ++i) {if ($i~/^InterPro:/) {gsub(/InterPro:/, "", $i); x=x","$i}} gsub (/^,/, "", x); print x; x=""}' file

输出：

IPR000504,IPR003954,IPR012677,IPR003954
IPR000306,IPR002423,IPR002498,IPR013083,IPR017455

使用缩进和更有意义的变量名称：

awk '
{
  for (column=1; column<=NF; ++column) 
  {
    if ($column~/^InterPro:/) 
    {
      gsub(/InterPro:/, "", $column)
      line=line","$column
    }
  } 
  gsub (/^,/, "",line)
  print line
  line=""
}' file

使用bash内置命令：

while IFS= read -r line; do 
  for column in $line; do
    [[ $column =~ ^InterPro:(.*) ]] && new+=",${BASH_REMATCH[1]}"
  done
  echo "${new#,*}"
  unset new
done < file

Answer 2

最后，我更改了脚本并获得了感兴趣的结果：

while read line; do
    NUM=$(echo $line | grep -oP 'InterPro:\K[^ ]+' | wc -l)
    if [ $NUM -eq 0 ];then
       echo "NA" >> InterPro.txt;
    fi; 
    if [ ! $NUM -eq 0 ];then
       echo $line | grep -oP 'InterPro:\K[^ ]+' | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}' | sed 's/ /,/g'  >> InterPro.txt;
    fi;
done <./File.txt

如何将一行中的所有重复模式提取为逗号分隔格式

2 个答案: