您好我想在AWK或R中执行此操作:
如果我有以下3个文件作为示例我想要gre_整行test_bed_file如果该文件的第3列以test_list_of_genes中存在的基因开头并将结果放在test_results中,因为它在下面突出显示
Example JsFiddle: https://jsfiddle.net/e64y5wfj/4/
到目前为止,这给了我想要的但不确定它是否是正确的解决方案:
grep -Fwf test_list_of_genes test_bed_file> test_result.txt
非常感谢任何帮助或建议
答案 0 :(得分:0)
这将在R ...
中完成test_bed_file[gsub("\\..*", "", test_bed_file[, 3]) %in% test_list_of_genes,]
删除第一个.
之后的所有内容,并检查这些是否在test_list_of_genes
中。然后它会提取test_bed_file
答案 1 :(得分:0)
test_bed_file <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1"), V2 = c(989121L, 989816L, 990192L, 1146926L,
1147072L, 1147310L, 1167647L, 1266714L, 1267006L, 1267392L, 3645879L,
3646552L, 3647479L, 3648015L, 3649299L, 5923313L), V3 = c(989367L,
989941L, 990371L, 1147015L, 1147222L, 1147528L, 1168655L, 1266926L,
1267328L, 1268196L, 3646022L, 3646722L, 3647639L, 3648130L, 3649650L,
5923475L), V4 = c("AGRN.chr1.989132.989357", "AGRN.chr1.989827.989931",
"AGRN.chr1.990203.990361", "TNFRSF4.chr1.1146938.1147005", "TNFRSF4.chr1.1147084.1147212",
"TNFRSF4.chr1.1147322.1147518", "B3GALT6.chr1.1167659.1168645",
"TAS1R3.chr1.1266726.1266916", "TAS1R3.chr1.1267018.1267318",
"TAS1R3.chr1.1267404.1268186", "TP73.chr1.3645891.3646012", "TP73.chr1.3646564.3646712",
"TP73.chr1.3647491.3647629", "TP73.chr1.3648027.3648120", "TP73.chr1.3649311.3649640",
"NPHP4.chr1.5923324.5923465")), .Names = c("V1", "V2", "V3",
"V4"), row.names = c(NA, -16L), class = c("data.table", "data.frame"
))
find <- c("AGRN", "B3GALT6", "TP73")
test_bed_file[unlist(sapply(find, function(x) grep(paste0("^", x), test_bed_file$V4))),]
这将在test_bed_file的第V4列的字符串开头搜索find[x]
V1 V2 V3 V4
1: chr1 989121 989367 AGRN.chr1.989132.989357
2: chr1 989816 989941 AGRN.chr1.989827.989931
3: chr1 990192 990371 AGRN.chr1.990203.990361
4: chr1 1167647 1168655 B3GALT6.chr1.1167659.1168645
5: chr1 3645879 3646022 TP73.chr1.3645891.3646012
6: chr1 3646552 3646722 TP73.chr1.3646564.3646712
7: chr1 3647479 3647639 TP73.chr1.3647491.3647629
8: chr1 3648015 3648130 TP73.chr1.3648027.3648120
9: chr1 3649299 3649650 TP73.chr1.3649311.3649640
答案 2 :(得分:0)
你的代码错了,你问:
is AGRN.chr1.989132.989357 in array a??
您需要从该字符串中提取gen,如下所示:
awk 'FNR==NR {a[$0]; next} {match($NF, /([[:alnum:]]+)\./, arr); if (arr[1] in a) print $0 > "test_results"}' test_list_of_genes test_bed_file
不需要外部重定向,因为awk允许在代码中使用它(打印$ 0&gt;“test_results”)。
有关Awk重定向的更多信息:https://www.gnu.org/software/gawk/manual/html_node/Redirection.html
匹配函数将gen提取为数组(arr)。然后你测试gen(arr [1])对你的基因阵列(a)。
有关Awk字符串函数的更多信息:https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html
希望这有帮助。
答案 3 :(得分:0)
尝试awk解决方案如下。
awk 'FNR==NR{a[$0]=$0;next} ($7 in a) || ($9 in a)' test_list_of_genes FS='[ .]' test_be_file
编辑:根据OP的请求添加输出。
awk 'FNR==NR{a[$0]=$0;next} ($7 in a) || ($9 in a)' test_list_of_genes FS='[ .]' test_be_file
chr1 989121 989367 AGRN.chr1.989132.989357
chr1 989816 989941 AGRN.chr1.989827.989931
chr1 990192 990371 AGRN.chr1.990203.990361
chr1 1167647 1168655 B3GALT6.chr1.1167659.1168645
chr1 3645879 3646022 TP73.chr1.3645891.3646012
chr1 3646552 3646722 TP73.chr1.3646564.3646712
chr1 3647479 3647639 TP73.chr1.3647491.3647629
chr1 3648015 3648130 TP73.chr1.3648027.3648120
chr1 3649299 3649650 TP73.chr1.3649311.3649640