Question

我正在尝试编写一个操作两个文件的bash脚本：

文件1

Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
2q33.2  2:204611195..204817281  rs3096851   204763882   A>C Alopecia Areata 3.58E-08    1.32    802
2q33.2  2:204611195..204817281  rs1024161   204721752   G>A Alopecia Areata 3.55E-13    1.44    802
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682241   123523875   C>A Alopecia Areata 4.27E-08    1.34    803
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

file2的

CHR_A         BP_A       SNP_A  CHR_B         BP_B       SNP_B           R2 
 2    204721752   rs1024161      2    204732714    rs231775     0.849535 
 2    204721752   rs1024161      2    204763882   rs3096851      0.68029 
 2    204732714    rs231775      2    204763882   rs3096851     0.739633 
 4    123523875   rs7682241      4    123524026   rs7682481            1

我想读取file1，如果file2的第3列（SNP_A）或第4列（SNP_B）中不存在第3列（RsId）值，则将该行写入输出。我尝试了以下方法：

#this executable file is called filter_file.sh

#!/bin/bash
file1=$1
file2=$2
outfile=$3

while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do 
cat $file2 | awk "(\$3!~/$SNP_A|$SNP_B/) {print}"
done < $file1 > $outfile

./filter_file.sh file1 file2 out

awk语句在我自己测试时有效，但是当我将它添加到bash while循环时，它会打印所有file1，包括标题，四次。 此步骤的代码有什么问题？

一旦这个工作，如果file1第3列（RsId）值存在于file2的第3列（SNP_A）或第4列（SNP_B）中，我想将该行写入具有file1第7列最低值的输出（p值）。

我不知道如何开始这个任务的第二部分。从阅读其他awk问题我想我可以尝试if语句设置这样的东西：

#!/bin/bash
file=$1
file2=$2
outfile=$3

while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do 
cat $file2 | awk "{
if ((\$3!~/$SNP_A|$SNP_B/))
    print $0;
else
    #Statement that prints only the row with the lowest value for column 7 
}"
done < $file1 > $outfile

我可以使用哪些方法来执行此步骤？

如果人们可以指出一些可能对这些类型的问题有所帮助的教程，我们非常感激。

所需的输出文件将如下所示（顺序无关紧要）：

Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

Answer 1

没有必要使用bash遍历文件，你可以在awk中完成整个过程：

$ awk 'NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; next}
       FNR > 1 && !($3 in snp_a || $3 in snp_b)' file2 file1
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

第一个块适用于第一个文件（file2），并在与所关注的两列相对应的数组中设置键。 next跳过任何进一步的命令，因此脚本的其余部分仅适用于第二个文件（file1）。条件为真时打印行，即文件中的行号（FNR）大于1，并且在两个数组中都找不到该键。

对于你问题的第二部分，事情变得有点复杂......希望这些评论可以解释：

$ cat script.awk
# first file, save individual columns and pairs
NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; pair[$3,$6]; next}

# second file, print first line
NR != FNR && FNR == 1

# second file, rest of lines
FNR > 1 {
    # print lines which aren't in either array
    if (!($3 in snp_a || $3 in snp_b)) {print}
    # save other lines and corresponding p-value
    else {s[$3] = $0; p[$3] = $8}
}
END {
    # loop through all lines
    for (i in s) {
        # empty the array f
        split("", f)
        # set initial line and min
        line = s[i]
        min = p[i]

        # locate associated lines
        if (i in snp_a) {
            for (j in snp_b) {
                # SUBSEP is a special variable used when combining keys
                # as in first block pair[$3,$6]
                if (i SUBSEP j in pair && j in s && p[j] < min) {
                    min = p[j]
                    line = s[j]
                }
            }
        }
        else if (i in snp_b) {
            for (j in snp_a) {
                if (j SUBSEP i in pair && j in s && p[j] < min) {
                    min = p[j]
                    line = s[j]
                }
            }
        }
        out[line]
    }
    for (i in out) print i
}
$ awk -f script.awk file2 file1
Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872
2q33.2  2:204611195..204817281  rs231775    204732714   A>G Alopecia Areata 2.20E-20    1.39    802
4q27    4:122982314..123605528  rs7682481   123524026   G>C Alopecia Areata 4.80E-09    1.23    803

我很确定END块中的逻辑可以简化，但我想不出更好的方法来确定“对”。无论哪种方式，它都可以实现您想要的输出。

Answer 2

您可以使用awk：

awk 'NR==FNR&&NR>1{r[NR]=$3$6;next}{p=1;for(i in r){if(r[i]~$3){p=0;break}}}p' \
  file2 file1

在包含评论的多行版本中更容易理解：

# NR (row number) and FNR (current input file's row number) are equal
# only as long as we are processing the first input file.
# Means the following block only runs on file2 which is getting passed first.
# NR > 1 makes sure that we skip the header line.
NR==FNR && NR>1 {
    # Push $3 concatenated with $6 to an array call r (rows)
    r[NR]=$3$6
    # Stop processing this line and use the next. This makes sure
    # that the following block will only operate on file1
    next
}

# The following block runs on every line of file1
{
    # Init p flag
    p=1 
    # Iterate to the previously stored values
    for(i in r) {
        # If the entry of r is not matching field 3
        # Print the current line.
        if(r[i]~$3){
            # If a line matches we reset the p flag
            # and break the loop
            p=0
            break
        }
    }   
}

# Print if p flag isset (print is the default action in awk)
p

输出：

CHR_A         BP_A       SNP_A  CHR_B         BP_B       SNP_B           R2 
Region  Coords. RsId    Position    Alleles Disease PValue  OddsRatio   RegionID
1p13.2  1:113839149..114551845  rs2476601   114377568   G>A Alopecia Areata 8.90E-08    1.34    869
2q13    2:111444884..111809030  rs3789129   111698040   A>C Alopecia Areata 1.50E-08    0.76    871
5q31.1  5:131783213..132135372  rs848   131996500   C>A Alopecia Areata 4.80E-09    1.27    872

使用第二个文件

2 个答案: