我正在尝试将两个数据帧中包含的数据汇总在一起(它不是真正的合并或连接),这取决于一个值是否落在第二个数据帧的范围内。
为方便起见,数据在帖子的末尾。 一个数据框(df1)如下所示:
Chromosome Position P.value start.range end.range name
2 4553493 8.23e-05 4453493 4653493 A
3 24548810 1.04e-04 24448810 24648810 B
1 9952003 2.09e-04 9852003 10052003 C
第二个df要长得多,但是头部(df2)看起来像这样:
ensembl_gene_id chromosome_name start_position end_position
OS01G0281600 1 10048273 10050309
OS01G0281400 1 10021423 10027120
OS01G0281301 1 10019633 10020376
OS01G0281200 1 10011875 10015468
OS01G0281100 1 10008075 10011595
OS01G0281000 1 10003952 10007742
我需要匹配每个IF的行df1 $ Position在df2 $ start_position或df2 $ end_position(即((df1$Position - df2$start_position)<100000 | (df1$Position - df2$end_position)<100000)
)的100,000之内。
作为输出,我需要匹配的行的列表或数据框。将有多个与df1匹配的df2值,并且每个染色体有多个条目,但df1 $ name是唯一的。我一直在尝试ddply和自定义函数的各种应用程序,但是我很快就会出现这种情况。有什么想法吗?
数据:
df1 <- structure(list(Chromosome = c(2L, 3L, 1L), Position = c(4553493L,
24548810L, 9952003L), P.value = c(8.23e-05, 0.000104, 0.000209
), start.range = c(4453493, 24448810, 9852003), end.range = c(4653493,
24648810, 10052003), name = c("A", "B", "C")), .Names = c("Chromosome",
"Position", "P.value", "start.range", "end.range", "name"), class = "data.frame", row.names = c(NA,
3L))
df2 <- structure(list(ensembl_gene_id = c("OS01G0281600", "OS01G0281400",
"OS01G0281301", "OS01G0281200", "OS01G0281100", "OS01G0281000",
"OS01G0280500", "OS01G0280400", "OS01G0280000", "OS01G0279900",
"OS01G0279800", "OS01G0279700", "OS01G0279400", "OS01G0279300",
"OS01G0279200", "OS01G0279100", "OS01G0279000", "OS01G0278900",
"OS01G0278950", "OS02G0183000", "OS02G0182850", "OS02G0182900",
"OS02G0182700", "OS02G0182800", "OS02G0182500", "OS02G0182300",
"OS02G0181900", "OS02G0182100", "OS02G0181800", "OS02G0181400",
"OS02G0180900", "OS02G0180700", "OS02G0180500", "OS02G0180200",
"OS02G0180400", "OS02G0180100", "OS03G0640300", "OS03G0640400",
"OS03G0640000", "OS03G0640100", "OS03G0639700", "OS03G0639800",
"OS03G0639600", "OS03G0639400", "OS03G0639300", "OS03G0638900",
"OS03G0639100", "OS03G0638400", "OS03G0638800", "OS03G0638300",
"OS03G0638200"), chromosome_name = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), start_position = c(10048273L,
10021423L, 10019633L, 10011875L, 10008075L, 10003952L, 9967185L,
9962807L, 9936850L, 9928971L, 9917593L, 9913390L, 9889550L, 9887657L,
9878384L, 9874379L, 9866730L, 9859354L, 9863216L, 4639932L, 4629617L,
4630446L, 4616832L, 4625425L, 4598883L, 4594375L, 4567630L, 4573831L,
4563073L, 4551426L, 4521670L, 4497115L, 4486531L, 4460342L, 4481872L,
4455016L, 24630180L, 24638186L, 24616417L, 24621460L, 24591421L,
24596843L, 24574540L, 24564913L, 24544511L, 24487877L, 24514494L,
24466606L, 24476060L, 24454477L, 24449135L), end_position = c(10050309L,
10027120L, 10020376L, 10015468L, 10011595L, 10007742L, 9969073L,
9966715L, 9947933L, 9935981L, 9921565L, 9917318L, 9902737L, 9889123L,
9885517L, 9876678L, 9870864L, 9860677L, 9866617L, 4641686L, 4630180L,
4634616L, 4621974L, 4628750L, 4601382L, 4595386L, 4573049L, 4578257L,
4566597L, 4552860L, 4523668L, 4500124L, 4489409L, 4463571L, 4483470L,
4457715L, 24634746L, 24641449L, 24617859L, 24629502L, 24596437L,
24600376L, 24579212L, 24565726L, 24549550L, 24489307L, 24515219L,
24473558L, 24480927L, 24457481L, 24453890L)), .Names = c("ensembl_gene_id",
"chromosome_name", "start_position", "end_position"), class = "data.frame", row.names = c(NA,
-51L))
答案 0 :(得分:1)
这是你想要的吗?
ddply(df1, .(name), function(x) {
df2[(x$Position - df2$start_position) < 100000 |
(x$Position - df2$end_position) < 100000, ]
})