我正在尝试将文件的输入排序为行而不是列。例如,如果我的输入是(不包括每行之间的空格):
ID0001 G0001
ID0001 G0004
ID0001 G2332
ID0001 G2332
ID0002 G0002
ID0002 G2332
输出:
ID0001 G00001,G00004,G2332
ID0002 G0002,G2332
这就是我目前所拥有的:
#!/bin/bash
uniq $1 > edited.original_ID.txt
counter=1
echo "$(awk 'NR==1{print $1}' edited.original_ID.txt) " >> out.csv
cat edited.original_ID.txt | while read line
do
UNIQUE_ID=$(awk '{print $1}' "NR==$counter" edited.original_ID.txt)
NEXT_ID=$(awk '{print $1}' "NR==$((counter+1))" edited.original_ID.txt)
if [ "${UNIQUE_ID}" == "${NEXT_ID}" ]
then
awk "NR==$counter" | awk '{print $2}' edited.original_ID.txt | xargs >> out.csv
elif [ "${UNIQUE_ID}" != "${NEXT_ID}" ]
then
echo "$(awk "NR==$counter" | awk '{print $1}' edited.original_ID.txt)" >> out.csv
echo -n "$(awk "NR==$counter" | awk '{print $1}' edited.original_ID.txt) " >> out.csv
fi
((counter++))
done
截至目前,除非我强行终止,否则我的代码不会结束。我非常肯定我的错误是在awk命令中,但我不确定如何操作它以便它将接收我的变量和列的第一部分。如果有人可以帮我解决错误,我将不胜感激! *我应该注意到你会看到我用不同的方式写了awk,我试图看看哪些会工作/没问题。
答案 0 :(得分:1)
使用awk
awk -v OFS=, '!tmp[$1,$2]++{arr[$1] =($1 in arr ? arr[$1] OFS : "" ) $2}
END{for(i in arr)print i" "arr[i]}' infile
<强>解释强>
awk -v OFS=, '# call awk, set output field separator as comma
# tmp is array, and field1 and field2 being array key/index
# !tmp[$1,$2]++ takes care of non duplicate values
# ++ is post increment, so whenever awk sees repetition of index, it will be incremented
# but since we are interested to avoid duplicates,
# so we take it only once
!tmp[$1,$2]++{
# arr is array, field1 being array key/index
# $1 in arr : if array has key before,
# then previous array value will be concatenated with 2nd field value, else just second field value
arr[$1] =($1 in arr ? arr[$1] OFS : "" ) $2
}
# end block which will be executed at then end as name says
END{
# iterate array arr,
# and print array key, and array value
for(i in arr)
print i" "arr[i]
}
' infile
测试结果:
$ cat infile
ID0001 G0001
ID0001 G0004
ID0001 G2332
ID0001 G2332
ID0002 G0002
ID0002 G2332
$ awk -v OFS=, '!tmp[$1,$2]++{arr[$1] =($1 in arr ? arr[$1] OFS : "" ) $2}END{for(i in arr)print i" "arr[i]}' infile
ID0001 G0001,G0004,G2332
ID0002 G0002,G2332
答案 1 :(得分:0)
一个小脚本'idsort.sh'作为Bash解决方案:
#!/bin/bash -
declare -A ID
while read id gval ; do
ID[$id]+=$gval"\n"
done < "$1"
for id in ${!ID[@]}; do
echo $id $( printf ${ID[$id]} | sort --unique )
done | sort
这样称呼:
idsort.sh infile > outfile
第一个循环将给定ID的所有G值收集为字符串,其中\ n为分隔符。 第二个循环将这些值传递给sort命令,并在关联的ID之后输出唯一的G值。这些行按第二个循环后的最终排序按升序ID排序。