从文本文件中选择具有作为另一个文件中的列的一部分列出的ID的行

时间:2017-08-29 09:05:16

标签: r shell awk

您好我想在AWK或R中执行此操作:

如果我有以下3个文件作为示例我想要gre_整行test_bed_file如果该文件的第3列以test_list_of_genes中存在的基因开头并将结果放在test_results中,因为它在下面突出显示

Example JsFiddle: https://jsfiddle.net/e64y5wfj/4/

到目前为止,这给了我想要的但不确定它是否是正确的解决方案:

grep -Fwf test_list_of_genes test_bed_file> test_result.txt

非常感谢任何帮助或建议

4 个答案:

答案 0 :(得分:0)

这将在R ...

中完成
test_bed_file[gsub("\\..*", "", test_bed_file[, 3]) %in% test_list_of_genes,]

删除第一个.之后的所有内容,并检查这些是否在test_list_of_genes中。然后它会提取test_bed_file

的行

答案 1 :(得分:0)

您的数据

test_bed_file <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1"), V2 = c(989121L, 989816L, 990192L, 1146926L, 
1147072L, 1147310L, 1167647L, 1266714L, 1267006L, 1267392L, 3645879L, 
3646552L, 3647479L, 3648015L, 3649299L, 5923313L), V3 = c(989367L, 
989941L, 990371L, 1147015L, 1147222L, 1147528L, 1168655L, 1266926L, 
1267328L, 1268196L, 3646022L, 3646722L, 3647639L, 3648130L, 3649650L, 
5923475L), V4 = c("AGRN.chr1.989132.989357", "AGRN.chr1.989827.989931", 
"AGRN.chr1.990203.990361", "TNFRSF4.chr1.1146938.1147005", "TNFRSF4.chr1.1147084.1147212", 
"TNFRSF4.chr1.1147322.1147518", "B3GALT6.chr1.1167659.1168645", 
"TAS1R3.chr1.1266726.1266916", "TAS1R3.chr1.1267018.1267318", 
"TAS1R3.chr1.1267404.1268186", "TP73.chr1.3645891.3646012", "TP73.chr1.3646564.3646712", 
"TP73.chr1.3647491.3647629", "TP73.chr1.3648027.3648120", "TP73.chr1.3649311.3649640", 
"NPHP4.chr1.5923324.5923465")), .Names = c("V1", "V2", "V3", 
"V4"), row.names = c(NA, -16L), class = c("data.table", "data.frame"
))

find <- c("AGRN", "B3GALT6", "TP73")

正则表达式解决方案

test_bed_file[unlist(sapply(find, function(x) grep(paste0("^", x), test_bed_file$V4))),]

这将在test_bed_file的第V4列的字符串开头搜索find[x]

输出

         V1      V2      V3                           V4
1: chr1  989121  989367      AGRN.chr1.989132.989357
2: chr1  989816  989941      AGRN.chr1.989827.989931
3: chr1  990192  990371      AGRN.chr1.990203.990361
4: chr1 1167647 1168655 B3GALT6.chr1.1167659.1168645
5: chr1 3645879 3646022    TP73.chr1.3645891.3646012
6: chr1 3646552 3646722    TP73.chr1.3646564.3646712
7: chr1 3647479 3647639    TP73.chr1.3647491.3647629
8: chr1 3648015 3648130    TP73.chr1.3648027.3648120
9: chr1 3649299 3649650    TP73.chr1.3649311.3649640

答案 2 :(得分:0)

你的代码错了,你问:

is AGRN.chr1.989132.989357 in array a??

您需要从该字符串中提取gen,如下所示:

awk 'FNR==NR {a[$0]; next} {match($NF, /([[:alnum:]]+)\./, arr); if (arr[1] in a) print $0 > "test_results"}' test_list_of_genes test_bed_file

不需要外部重定向,因为awk允许在代码中使用它(打印$ 0&gt;“test_results”)。

有关Awk重定向的更多信息:https://www.gnu.org/software/gawk/manual/html_node/Redirection.html

匹配函数将gen提取为数组(arr)。然后你测试gen(arr [1])对你的基因阵列(a)。

有关Awk字符串函数的更多信息:https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

希望这有帮助。

答案 3 :(得分:0)

尝试awk解决方案如下。

 awk  'FNR==NR{a[$0]=$0;next} ($7 in a) || ($9 in a)'   test_list_of_genes  FS='[ .]' test_be_file 

编辑:根据OP的请求添加输出。

awk  'FNR==NR{a[$0]=$0;next} ($7 in a) || ($9 in a)'   test_list_of_genes  FS='[ .]' test_be_file
chr1    989121  989367  AGRN.chr1.989132.989357
chr1    989816  989941  AGRN.chr1.989827.989931
chr1    990192  990371  AGRN.chr1.990203.990361
chr1    1167647 1168655 B3GALT6.chr1.1167659.1168645
chr1    3645879 3646022 TP73.chr1.3645891.3646012
chr1    3646552 3646722 TP73.chr1.3646564.3646712
chr1    3647479 3647639 TP73.chr1.3647491.3647629
chr1    3648015 3648130 TP73.chr1.3648027.3648120
chr1    3649299 3649650 TP73.chr1.3649311.3649640