Question

我想灵活地将两个小awk的输出打印到bash管道，这些管道使用变量（它们最初工作）。我最初认为我可以将整个命令存储为变量本身，但对于一个它不起作用，显然（store awk command in a variable of bash script）这不是一个好主意。所以我写了两个函数，但是我在“完成”附近得到一个“意外的令牌”，但它的格式如上面的链接。

我的错误在哪里？

for coverage_file in */*.cov
do
    #gene_count=$(awk '{print $5}' $coverage_file |sort | uniq -c | wc -l) #this is apparently not a good idea
    #contig_count=$(awk '{print $1}' $coverage_file |sort | uniq -c | wc -l) #this is apparently not a good idea
    cmd_gene() { awk '{print $5}' $coverage_file |sort | uniq -c | wc -l }
    cmd_contig() { awk '{print $1}' $coverage_file |sort | uniq -c | wc -l }
    cmd_gene $coverage_file
    cmd_contig $coverage_file
    #print "we found", $gene_count, "genes on ",$contig_count" contigs
done

cov文件如下所示：

k141_85332.3 4119 19 A5 phnM_031
k141_85332.3 4119 19 A5 phnM_031
k141_85332.3 4119 28 A1 phnM_031
k141_85332.3 4119 28 A1 phnM_031
k141_85332.3 4119 8 A2 phnM_031
k141_85332.3 4119 8 A2 phnM_031
k141_88684 267 5 B10 phnM_032
k141_88684 268 5 B10 phnM_032
k141_88684 269 5 B10 phnM_032
k141_88684 270 5 B10 phnM_032
k141_88684 271 5 B10 phnM_032
k141_88684 272 5 B10 phnM_032

编辑：这包括已接受的答案+明确打印的可能方式：

#!/bin/bash

#define variables
gene="phnM"
threshold="5"

#define functions
cmd_gene() { awk '{print $5}' $1 |sort | uniq -c | wc -l ; } #semicolon is important here!
cmd_contig() { awk '{print $1}' $1 |sort | uniq -c | wc -l ; } #semicolon is important here!

#loop over files and print results (would be prettier with printf)
for coverage_file in */*.cov
do
    echo $gene" was found" $(cmd_gene "$coverage_file") "times on" $(cmd_contig "$coverage_file")" contigs with minimum coverage of" $threshold in $coverage_file
done

输出：

phnM was found 67 times on 65 contigs with minimum coverage of 5 in phnm/test.cov
phnM was found 3 times on 2 contigs with minimum coverage of 5 in test/test.cov

Answer 1

意外的令牌错误即将发生，因为当你定义一个函数时，}必须在它自己的行上或前面有;。

此外，由于您在功能定义中使用$coverage_file，因此您无需通过该功能。

for coverage_file in */*.cov
do
    cmd_gene() { awk '{print $5}' $coverage_file |sort | uniq -c | wc -l; }
    cmd_contig() { awk '{print $1}' $coverage_file |sort | uniq -c | wc -l; }
    cmd_gene 
    cmd_contig 
    #print "we found", $gene_count, "genes on ",$contig_count" contigs
done

如果你想定义for循环之外的函数，你可以使用$1（不要与awk＆＃39; s $ 1混淆）并像之前那样传递$coverage_file。< / p>

编辑：以上示例

$ cat a.sh
cmd_gene() { awk '{print $5}' $1 |sort | uniq -c | wc -l; }
cmd_contig() { awk '{print $1}' $1 |sort | uniq -c | wc -l; }

for coverage_file in */*.cov
do
    cmd_gene $coverage_file
    cmd_contig $coverage_file
done

$ ls */*.cov
bf/a.cov

$ cat */*.cov
k141_85332.3 4119 19 A5 phnM_031
k141_85332.3 4119 19 A5 phnM_031
k141_85332.3 4119 28 A1 phnM_031
k141_85332.3 4119 28 A1 phnM_031
k141_85332.3 4119 8 A2 phnM_031
k141_85332.3 4119 8 A2 phnM_031
k141_88684 267 5 B10 phnM_032
k141_88684 268 5 B10 phnM_032
k141_88684 269 5 B10 phnM_032
k141_88684 270 5 B10 phnM_032
k141_88684 271 5 B10 phnM_032
k141_88684 272 5 B10 phnM_032

$ sh a.sh
       2
       2

Answer 2

@jas回答了你的问题，所以坚持下去，以下只是一个更好的方法来做你想做的事情，它太大/格式化不适合评论：

awk '
BEGIN {
    gene = "phnM"
    threshold = "5"
}
{
    genes[$5]
    contigs[$1]
}
ENDFILE {
    printf "%s was found %d times on %d contigs with minimum coverage of %d in %s\n",
        gene, length(genes), length(contigs), threshold, FILENAME
    delete genes
    delete contigs
}
' */*.cov

以上使用GNU awk作为ENDFILE，但如果有必要，它可以使其适用于其他awk：

awk '
BEGIN {
    gene = "phnM"
    threshold = "5"
}
FNR==1 { prt() }
{
    genes[$5]
    contigs[$1]
}
END { prt() }
function prt() {
    if (fname != "") {
        printf "%s was found %d times on %d contigs with minimum coverage of %d in %s\n",
            gene, length(genes), length(contigs), threshold, fname
        delete genes
        delete contigs
    }
    fname = FILENAME
}
' */*.cov

请参阅https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice了解一些在操作文本时避免shell循环的原因。

awk中用户定义函数的打印输出给出了意外的令牌错误

2 个答案: