Question

我有一个数组，我们称它为ensembldb，其中包含以下几行：

rs2799070   ENST00000379389 ENSG00000187608 ISG15   inframe_insertion   NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NM_005101.3    NP_005092
rs2799070   ENST00000458555 ENSG00000224969 AL645608.2  missense_variant    NA  NA  antisense   NA  NULL    NULL
rs2799070   ENST00000624652 ENSG00000187608 ISG15   inframe_deletion    NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL
rs2799070   ENST00000624697 ENSG00000187608 ISG15   frameshift_variant  NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL

和另一个ordered array，我们称之为ordered_array：

frameshift_variant
missense_variant
inframe_insertion
inframe_deletion

我想对数组ensembldb进行排序，以匹配数组ordered_array中的订单。预期的输出如下：

rs2799070   ENST00000624697 ENSG00000187608 ISG15   frameshift_variant  NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL
rs2799070   ENST00000458555 ENSG00000224969 AL645608.2  missense_variant    NA  NA  antisense   NA  NULL    NULL
rs2799070   ENST00000379389 ENSG00000187608 ISG15   inframe_insertion   NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NM_005101.3    NP_005092
rs2799070   ENST00000624652 ENSG00000187608 ISG15   inframe_deletion    NA  NA  protein_coding  ISG15   ubiquitin-like  modifier    [Source:HGNC    Symbol;Acc:HGNC:4053]NULL   NULL

我检查了此question，但由于它是多维数组，所以无法回答我的问题。如何根据已排序的数组ensembldb对我的数组ordered_array进行排序？

谢谢。

编辑1：根据@anubhava的要求添加代码

declare -A ordered_array
ordered_array[0]="frameshift_variant"
ordered_array[1]="missense_variant"
ordered_array[2]="inframe_insertion"
ordered_array[3]="inframe_deletion"

while read -r LINE; do
    chrom=$(echo -e "$LINE" | cut -f1 -d$'\t' | sed 's/^chr//g')
    pos=$(echo -e "$LINE" | cut -f2 -d$'\t')
    ref=$(echo -e "$LINE" | cut -f3 -d$'\t')
    alt=$(echo -e "$LINE" | cut -f4 -d$'\t')
    LINE=$(echo -e "$LINE" | sed 's/^chr//g')
    ensembldb=$(echo "PREPARE stmt1 FROM 'SELECT Annotated_ID, Transcript, Gene_ID, Gene_name, Consequence, Swissprot_ID, AA_change, Biotype, Gene_description, RefSeq_mRNA, RefSeq_peptide FROM SNP_annot.37_annot_ensembl_89_full_descr where chrom = \"$chrom\" and Start = \"$pos\" and Local_alleles = \"$ref/$alt\"'; execute stmt1;" | mariadb -A -N)
    readarray -t array <<< "$ensembldb"
    pos19=$(echo "PREPARE stmt2 FROM 'select hg19_pos from SNP_annot.mut_convert_pos where chrom = \"$chrom\" and hg38_pos = \"$pos\"'; execute stmt2;" | mariadb -A -N)
    hits=$(echo -e "$ensembldb" | wc -l)
    [ ! -z "$pos19" ] && awk -v line="$LINE" -v pos="$pos19" -v ensembl="$ensembldb" -v hit="$hits" 'BEGIN {print line"\t"ensembl"\t"hit"\t"pos}'
done

1。变量LINE的行如下：

CHROM   POS REF ALT QUAL    DP  Genotype
chr1    16495   G   C   1722.77 252 G/C
chr1    16719   T   A   145.77  189 T/A
chr1    16841   G   T   701.77  521 G/T
chr1    17626   G   A   154.77  124 G/A

2。变量ensembldb是一个MySQL查询，它返回多行并转换为数组。它包含我要根据ordered_array排序的行，并选择与ordered_array匹配的第一行。

Answer 1

此awk可能对您有用：

awk 'FNR==NR{a[$5]=$0;next}{print a[$1]}' file_a file_b

如果a和b确实是数组：

readarray -t a < <(awk 'FNR==NR{a[$5]=$0;next}{print a[$1]}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}"))

按照有序数组模式对bash数组进行排序

1 个答案: