如何使用我的变量和唯一标识符(使用awk命令)将输入文件格式化为行而不是列?

时间:2017-12-12 08:22:03

标签: bash awk format

我正在尝试将文件的输入排序为行而不是列。例如,如果我的输入是(不包括每行之间的空格):

  

ID0001 G0001

     

ID0001 G0004

     

ID0001 G2332

     

ID0001 G2332

     

ID0002 G0002

     

ID0002 G2332

  • 输出中不应包含相同ID的重复项,但可以在不同的ID中包含重复的数字。 (同样,排除每行之间的空格。)

输出:

  

ID0001 G00001,G00004,G2332

     

ID0002 G0002,G2332

这就是我目前所拥有的:

#!/bin/bash

uniq $1 > edited.original_ID.txt

counter=1
echo "$(awk 'NR==1{print $1}' edited.original_ID.txt) " >> out.csv

cat edited.original_ID.txt | while read line
do
  UNIQUE_ID=$(awk '{print $1}' "NR==$counter" edited.original_ID.txt)
  NEXT_ID=$(awk '{print $1}' "NR==$((counter+1))" edited.original_ID.txt)

  if [ "${UNIQUE_ID}" == "${NEXT_ID}" ]
  then
     awk "NR==$counter" | awk '{print $2}' edited.original_ID.txt | xargs >> out.csv
  elif [ "${UNIQUE_ID}" != "${NEXT_ID}" ]
  then
     echo "$(awk "NR==$counter" | awk '{print $1}' edited.original_ID.txt)" >> out.csv
     echo -n "$(awk "NR==$counter" | awk '{print $1}' edited.original_ID.txt) " >> out.csv
  fi

  ((counter++))
done

截至目前,除非我强行终止,否则我的代码不会结束。我非常肯定我的错误是在awk命令中,但我不确定如何操作它以便它将接收我的变量和列的第一部分。如果有人可以帮我解决错误,我将不胜感激! *我应该注意到你会看到我用不同的方式写了awk,我试图看看哪些会工作/没问题。

2 个答案:

答案 0 :(得分:1)

使用awk

awk -v OFS=, '!tmp[$1,$2]++{arr[$1] =($1 in arr ? arr[$1] OFS : "" ) $2}
              END{for(i in arr)print i" "arr[i]}' infile

<强>解释

awk -v OFS=, '# call awk, set output field separator as comma              

              #  tmp is array, and field1 and field2 being array key/index
              # !tmp[$1,$2]++ takes care of non duplicate values
              # ++ is post increment, so whenever awk sees repetition of index, it will be incremented 
              # but since we are interested to avoid duplicates, 
              # so we take it only once 

              !tmp[$1,$2]++{ 

                  # arr is array, field1 being array key/index
                  # $1 in arr : if array has key before,
                  # then previous array value will be concatenated with 2nd field value, else just second field value

                  arr[$1] =($1 in arr ? arr[$1] OFS : "" ) $2
              }

              # end block which will be executed at then end as name says
              END{

                  # iterate array arr, 
                  # and print array key, and array value

                  for(i in arr)
                     print i" "arr[i]
              }
              ' infile

测试结果:

$ cat infile
ID0001 G0001
ID0001 G0004
ID0001 G2332
ID0001 G2332
ID0002 G0002
ID0002 G2332

$ awk -v OFS=, '!tmp[$1,$2]++{arr[$1] =($1 in arr ? arr[$1] OFS : "" ) $2}END{for(i in arr)print i" "arr[i]}' infile
ID0001 G0001,G0004,G2332
ID0002 G0002,G2332

答案 1 :(得分:0)

一个小脚本'idsort.sh'作为Bash解决方案:

    #!/bin/bash -

    declare -A ID

    while read id gval ; do
      ID[$id]+=$gval"\n"
    done < "$1"

    for id in ${!ID[@]}; do
      echo $id $( printf  ${ID[$id]} | sort --unique )
    done | sort

这样称呼:

   idsort.sh infile > outfile

第一个循环将给定ID的所有G值收集为字符串,其中\ n为分隔符。 第二个循环将这些值传递给sort命令,并在关联的ID之后输出唯一的G值。这些行按第二个循环后的最终排序按升序ID排序。