我有一个包含多个字段的文件。我试图在同一个字段中删除双重(例如:具有不同日期的两个相同属性。)。例如:
Andro manual gene 1 100 . + . ID=truc;Name=truc;modified=13-09-1993;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;modified=13-09-1993;added=13-09-1993;modified=13-09-1997
我们可以看到modified = 13-09-1993和modified = 13-09-1997是doublelons。所以我想得到这个:
Andro manual gene 1 100 . + . ID=truc;Name=truc;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;added=13-09-1993;modified=13-09-1997
我想保留特定属性的最新发生并删除最旧的属性。它们在同一行中的属性最多只有两倍。
我已经尝试过这段代码(现在正在运行):
INPUT=$1
ID=$2
ALL_FEATURES=()
CONTIG_FEATURES=$(grep $ID $INPUT)
while read LINE; do
FEATURES=$(echo -e "$LINE" | cut -f 9)
#For each line, store all attributes from every line in an array
IFS=';' read -r -a ARRAY <<< "$FEATURES"
#Once the array is created, loop in array to look for doublons
for INDEX in "${!ARRAY[@]}"
do
ELEMENT=${ARRAY[INDEX]}
#If we are not at the end of the array, compare actual element and next element
ACTUAL=$ELEMENT
for INDEX2 in "${!ARRAY[@]}"
do
NEXT="${ARRAY[INDEX2]}"
ATTRIBUTE1=$(echo -e "$ACTUAL" | cut -d'=' -f1)
ATTRIBUTE2=$(echo -e "$NEXT" | cut -d'=' -f1)
echo "Comparing element number $INDEX ($ATTRIBUTE1) with element number $INDEX2 ($ATTRIBUTE2) ..."
if [[ $ATTRIBUTE1 = $ATTRIBUTE2 ]] && [[ $INDEX -ne $INDEX2 ]]
then
echo "Deleting features..."
#Delete actual element, because next element will be more recent
NEW=()
for VAL in "${ARRAY[@]}"
do
[[ $VAL != "${ARRAY[INDEX]}" ]] && NEW+=($VAL)
done
ARRAY=("${NEW[@]}")
unset NEW
fi
done
done
#Rewriting array into string separated by ;
FEATURES2=$( IFS=$';'; echo "${ARRAY[*]}" )
sed -i "s/$FEATURES/$FEATURES2/g" $INPUT
done < <(echo -e "$CONTIG_FEATURES")
我需要建议,因为我认为我的阵列认可可能不是一个聪明的,但我想要任何情况下的bash解决方案。如果有人有一些bash adives / shortcuts,那么任何建议都会受到赞赏,以提高我对bash的理解。
如果我忘记了任何细节,我很抱歉,谢谢你的帮助。
的Roxane
答案 0 :(得分:3)
在awk中:
$ awk '
{
n=split($NF,a,";") # split the last field by ;
for(i=n;i>=1;i--) { # iterate them backwards to keep the last "doublon"
split(a[i],b,"=") # split key=value at =
if(b[1] in c==0) { # if key not in c hash
d=a[i] (d==""?"":";") d # append key=value to d with ;
c[b[1]] # hash key into c
}
}
$NF=d # set d to last field
delete c # clear c for next record
d="" # deetoo
}
1 # output
' file
Andro manual gene 1 100 . + . ID=truc;Name=truc;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;added=13-09-1993;modified=13-09-1997
答案 1 :(得分:1)
关注awk也可以帮助你。
awk -F';' '{
for(i=NF;i>0;i--){
split($i, array,"=");
if(++a[array[1]]>1){
$i="\b"
}
};
delete a
}
1
' OFS=";" Input_file
输出如下。
Andro manual gene 1 100 . + . ID=truc;Name=truc;added=13-09-1993;modified=13-09-1997
Andro manual mRNA 1 100 . + . ID=truc-mRNA;Name=truc-mRNA;added=13-09-1993;modified=13-09-1997