我有两个不同大小的数据框架,我正在寻找最有效的方法来匹配从一个data.frame到另一个data.frame的字符串,并提取一些相关信息。
以下是一个例子:
两个初始data.frames,a和b,以及所需的结果:
a = data.frame(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.frame(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))
desired_result = data.frame(string = b$string,
num_matches = c(2, 1, 3, 0),
avg_age = c(37, 73, 32.66667, NA),
avg_visits = c(3.5, 8, 2.66667, NA))
以下是更易读的格式的data.frame:
> a
term age visits
1 red 30 5
2 salad 24 1
3 rope 52 3
4 ball 44 2
5 tent 73 8
6 plane 44 5
7 gift 33 19
8 meat 12 3
> b
string
1 the red ball went over the fence
2 sorry to see that your tent fell down
3 the ball fell into the red salad
4 serious people eat peanuts on Sundays
> desired_result
string num_matches avg_age avg_visits
1 the red ball went over the fence 2 37.00000 3.50000
2 sorry to see that your tent fell down 1 73.00000 8.00000
3 the ball fell into the red salad 3 32.66667 2.66667
4 serious people eat peanuts on Sundays 0 NA NA
关于如何以有效的方式实现这一点的任何想法?
谢谢。
答案 0 :(得分:2)
您可以尝试使用基础R(不需要包装):
res <- t(apply(b, 1, function(x) {
l <- strsplit(x, " ")
r <- unlist(lapply(unlist(l), function(y) which(a$term==y)))
rbind(length(r), mean(a$age[r]), mean(a$visits[r]))
}))
res <- cbind(b, res)
# string 1 2 3
# 1 the red ball went over the fence 2 37.00000 3.500000
# 2 sorry to see that your tent fell down 1 73.00000 8.000000
# 3 the ball fell into the red salad 3 32.66667 2.666667
# 4 serious people eat peanuts on Sundays 0 NaN NaN
答案 1 :(得分:1)
使用data.table
,使用by = string
处理每一行。将匹配结果保存在列表中,然后按匹配结果进行汇总。
注意matches
列是列表,每个单元格都有一个列表。您需要将匹配结果包装为.()
,这实际上是另一个list()
,因为data.table需要正常列的列表。
library(data.table)
library(stringr)
a = data.table(term = c("red", "salad", "rope", "ball", "tent", "plane", "gift", "meat"),
age = c(30, 24, 52, 44, 73, 44, 33, 12),
visits = c(5, 1, 3, 2, 8, 5, 19, 3))
b = data.table(string = c("the red ball went over the fence",
"sorry to see that your tent fell down",
"the ball fell into the red salad",
"serious people eat peanuts on Sundays"))
b[, matches := vector("list", .N)]
b[, matches := .(list(str_detect(string, a[, term]))), by = string]
b[, num_matches := sum(unlist(matches)), by = string]
b[, avg_age := mean(a[unlist(matches), age]), by = string]
b[, avg_visits := mean(a[unlist(matches), visits]), by = string]
答案 2 :(得分:0)
我会在一个接一个地构建desired_result
:
因此,你需要两个函数来计算,一个来计算平均值。
首先出现:
counter <- function(sentence, pattern)
{
count <-0
for (var in pattern)
{
if(grepl(pattern=var, sentence)) count <- count +1
}
return(count)
}
对于两个平均值,您可以在两种情况下使用该函数:
average <- function(sentence, look_up)
{
pattern <- look_up[,1]
count <-0
summed <- 0
for (var in pattern)
{
if(grepl(pattern=var, sentence)) {
count <- count + 1
summed <- sum(look_up[look_up[,1]==var,2]) + summed
}
}
return(summed/count)
}
这可以通过以下方式应用于您的数据:
首先:
desire_result <- data.frame(string = b$string)
然后获取值:
desired_result$num_match<- sapply(b$string,counter,pattern=a$term)
desired_result$avg_age<- sapply(b$string,average,look_up=a[,c(1,2)])
desired_result$avg_visit<- sapply(b$string,average,look_up=a[,c(1,3)])
现在,您可以在问题中提到desired_result
:
> desired_result
string num_match avg_age avg_visit
1 the red ball went over the fence 2 37.00000 3.500000
2 sorry to see that your tent fell down 1 73.00000 8.000000
3 the ball fell into the red salad 3 32.66667 2.666667
4 serious people eat peanuts on Sundays 0 NaN NaN