我有两个向量。对于向量 A 的每个元素,我想知道满足某个条件的向量 B 的所有元素。因此,例如,两个包含向量的数据帧:
person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand",
"moodring", "figurine", "whistle", "saxophone"),
cost = c(6, 11, 13, 17, 21, 23, 25, 30))
对于此示例,“person”数据框中的每个人都有一些来自嘉年华游戏的门票,而“奖品”数据框中的每个奖品都有一个成本。但我不是在寻找完美的比赛;他们不是简单地购买奖品,而是随机收到任何奖品,这些奖品的成本容差为5票。
我正在寻找的输出是每个人可能赢得的所有可能奖品的数据框。它会是这样的:
person prize
1 Albert stickyhand
2 Albert moodring
3 Albert figurine
4 Albert whistle
5 Becca moodring
6 Becca figurine
... ...
等等。现在,我正在使用lapply()
执行此操作,但这实际上并不比R中的for()
循环快。
library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
matchlist <- lapply(split(person, list(person$name)),
function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
longlist <- data.frame("person" = rep(names(matchlist),
times = unlist(lapply(matchlist, length))),
"prize" = unname(unlist(matchlist))
)
return(longlist)
}
matching_Function(person, prize)
我的实际数据集要大得多(数十万),我的匹配条件更复杂(检查 B 的坐标,看它们是否在 A ,所以这是永远(几个小时)。
有没有比for()
和lapply()
更聪明的方法来解决这个问题?
答案 0 :(得分:3)
来自foverlaps
的{{1}}做出您想做的事情的替代方案:
data.table
输出:
require(data.table)
# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)
我希望我评论的代码足够自我解释。
对于问题的第二部分,使用半径内的坐标和测试。
name prize
1: Albert stickyhand
2: Albert moodring
3: Albert figurine
4: Albert whistle
5: Becca moodring
6: Becca figurine
7: Becca whistle
8: Celine lollipop
9: Celine yo-yo
10: Celine stickyhand
11: Celine moodring
12: Dagwood yo-yo
13: Dagwood stickyhand
14: Dagwood moodring
Data.table允许在person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"),
x = c(26, 16, 32, 51),
y = c(92, 51, 25, 4)),
.Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"),
x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17),
y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)),
.Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")
setDT(person)
setDT(antenas)
r<-10
results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]
中表达,因此我们可以针对每个人对天线进行外连接的数学计算,并仅返回具有天线名称的相关行。
这不应该耗费大量内存,因为它是针对每个人而不是整个行进行的。
受this question启发的数学
这给:
j
答案 1 :(得分:1)
这对您的测试数据和完整的外部联接非常简单:
library(data.table)
setDT(person)
setDT(prize)
person[, JA := 1]
prize[, JA := 1]
merge(person,prize, by = "JA", allow.cartesian = TRUE)[abs(tickets - cost) < 6, .(name, type)]
# name type
# 1: Albert stickyhand
# 2: Albert moodring
# 3: Albert figurine
# 4: Albert whistle
# 5: Becca moodring
# 6: Becca figurine
# 7: Becca whistle
# 8: Celine lollipop
# 9: Celine yo-yo
# 10: Celine stickyhand
# 11: Celine moodring
# 12: Dagwood yo-yo
# 13: Dagwood stickyhand
# 14: Dagwood moodring
我们正在做的是完全外连接,然后排除任何不符合标准的行。
但是,如果这是10万到100,000的完全外连接,则使用此方法可能会耗尽内存。在这种情况下,我会并行化:
library(data.table)
library(foreach)
library(doParallel)
setDT(person)
setDT(prize)
person[, JA := 1]
prize[, JA := 1]
seq_s <- seq(1,nrow(person), by = 500) #change the 500 here based on memory/speed tradeoff
ln_s <- length(seq_s)
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(person) + 1) - 1)
cl<-makeCluster(4)
registerDoParallel(cl)
ls<-foreach(i = 1:ln_s) %dopar% {
library(data.table)
person_batch <- person[eval(parse(text = str_seq[i]))]
Output <- merge(person_batch,prize, by = "JA", allow.cartesian = TRUE)
Output <- Output[abs(tickets - cost) < 6, .(name, type)]
}
stopCluster(cl)
Output <- unique(do.call(rbind,ls))
这基本上是完全相同的过程,只是分成较小的批次,不会遇到内存限制,因为我们正在快速过滤