我正在尝试编写一个操作两个文件的bash脚本:
文件1
Region Coords. RsId Position Alleles Disease PValue OddsRatio RegionID
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
2q33.2 2:204611195..204817281 rs3096851 204763882 A>C Alopecia Areata 3.58E-08 1.32 802
2q33.2 2:204611195..204817281 rs1024161 204721752 G>A Alopecia Areata 3.55E-13 1.44 802
2q33.2 2:204611195..204817281 rs231775 204732714 A>G Alopecia Areata 2.20E-20 1.39 802
4q27 4:122982314..123605528 rs7682241 123523875 C>A Alopecia Areata 4.27E-08 1.34 803
4q27 4:122982314..123605528 rs7682481 123524026 G>C Alopecia Areata 4.80E-09 1.23 803
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872
file2的
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
2 204721752 rs1024161 2 204732714 rs231775 0.849535
2 204721752 rs1024161 2 204763882 rs3096851 0.68029
2 204732714 rs231775 2 204763882 rs3096851 0.739633
4 123523875 rs7682241 4 123524026 rs7682481 1
我想读取file1,如果file2的第3列(SNP_A)或第4列(SNP_B)中不存在第3列(RsId)值,则将该行写入输出。我尝试了以下方法:
#this executable file is called filter_file.sh
#!/bin/bash
file1=$1
file2=$2
outfile=$3
while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do
cat $file2 | awk "(\$3!~/$SNP_A|$SNP_B/) {print}"
done < $file1 > $outfile
./filter_file.sh file1 file2 out
awk语句在我自己测试时有效,但是当我将它添加到bash while循环时,它会打印所有file1,包括标题,四次。 此步骤的代码有什么问题?
一旦这个工作,如果file1第3列(RsId)值存在于file2的第3列(SNP_A)或第4列(SNP_B)中,我想将该行写入具有file1第7列最低值的输出(p值)。
我不知道如何开始这个任务的第二部分。从阅读其他awk问题我想我可以尝试if语句设置这样的东西:
#!/bin/bash
file=$1
file2=$2
outfile=$3
while read CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2; do
cat $file2 | awk "{
if ((\$3!~/$SNP_A|$SNP_B/))
print $0;
else
#Statement that prints only the row with the lowest value for column 7
}"
done < $file1 > $outfile
我可以使用哪些方法来执行此步骤?
如果人们可以指出一些可能对这些类型的问题有所帮助的教程,我们非常感激。
所需的输出文件将如下所示(顺序无关紧要):
Region Coords. RsId Position Alleles Disease PValue OddsRatio RegionID
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
2q33.2 2:204611195..204817281 rs231775 204732714 A>G Alopecia Areata 2.20E-20 1.39 802
4q27 4:122982314..123605528 rs7682481 123524026 G>C Alopecia Areata 4.80E-09 1.23 803
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872
答案 0 :(得分:3)
没有必要使用bash遍历文件,你可以在awk中完成整个过程:
$ awk 'NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; next}
FNR > 1 && !($3 in snp_a || $3 in snp_b)' file2 file1
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872
第一个块适用于第一个文件(file2
),并在与所关注的两列相对应的数组中设置键。 next
跳过任何进一步的命令,因此脚本的其余部分仅适用于第二个文件(file1
)。条件为真时打印行,即文件中的行号(FNR
)大于1,并且在两个数组中都找不到该键。
对于你问题的第二部分,事情变得有点复杂......希望这些评论可以解释:
$ cat script.awk
# first file, save individual columns and pairs
NR == FNR && NR > 1 {snp_a[$3]; snp_b[$6]; pair[$3,$6]; next}
# second file, print first line
NR != FNR && FNR == 1
# second file, rest of lines
FNR > 1 {
# print lines which aren't in either array
if (!($3 in snp_a || $3 in snp_b)) {print}
# save other lines and corresponding p-value
else {s[$3] = $0; p[$3] = $8}
}
END {
# loop through all lines
for (i in s) {
# empty the array f
split("", f)
# set initial line and min
line = s[i]
min = p[i]
# locate associated lines
if (i in snp_a) {
for (j in snp_b) {
# SUBSEP is a special variable used when combining keys
# as in first block pair[$3,$6]
if (i SUBSEP j in pair && j in s && p[j] < min) {
min = p[j]
line = s[j]
}
}
}
else if (i in snp_b) {
for (j in snp_a) {
if (j SUBSEP i in pair && j in s && p[j] < min) {
min = p[j]
line = s[j]
}
}
}
out[line]
}
for (i in out) print i
}
$ awk -f script.awk file2 file1
Region Coords. RsId Position Alleles Disease PValue OddsRatio RegionID
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872
2q33.2 2:204611195..204817281 rs231775 204732714 A>G Alopecia Areata 2.20E-20 1.39 802
4q27 4:122982314..123605528 rs7682481 123524026 G>C Alopecia Areata 4.80E-09 1.23 803
我很确定END
块中的逻辑可以简化,但我想不出更好的方法来确定“对”。无论哪种方式,它都可以实现您想要的输出。
答案 1 :(得分:0)
您可以使用awk
:
awk 'NR==FNR&&NR>1{r[NR]=$3$6;next}{p=1;for(i in r){if(r[i]~$3){p=0;break}}}p' \
file2 file1
在包含评论的多行版本中更容易理解:
# NR (row number) and FNR (current input file's row number) are equal
# only as long as we are processing the first input file.
# Means the following block only runs on file2 which is getting passed first.
# NR > 1 makes sure that we skip the header line.
NR==FNR && NR>1 {
# Push $3 concatenated with $6 to an array call r (rows)
r[NR]=$3$6
# Stop processing this line and use the next. This makes sure
# that the following block will only operate on file1
next
}
# The following block runs on every line of file1
{
# Init p flag
p=1
# Iterate to the previously stored values
for(i in r) {
# If the entry of r is not matching field 3
# Print the current line.
if(r[i]~$3){
# If a line matches we reset the p flag
# and break the loop
p=0
break
}
}
}
# Print if p flag isset (print is the default action in awk)
p
输出:
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
Region Coords. RsId Position Alleles Disease PValue OddsRatio RegionID
1p13.2 1:113839149..114551845 rs2476601 114377568 G>A Alopecia Areata 8.90E-08 1.34 869
2q13 2:111444884..111809030 rs3789129 111698040 A>C Alopecia Areata 1.50E-08 0.76 871
5q31.1 5:131783213..132135372 rs848 131996500 C>A Alopecia Areata 4.80E-09 1.27 872