我有两个数据框:
df.1 <- data.frame(loc = c('A','B','C','C'), person = c(1,2,3,4), str = c("door / window / table", "window / table / toilet / vase ", "TV / remote / phone / window", "book / vase / car / chair"))
因此,
loc person str
1 A 1 door / window / table
2 B 2 window / table / toilet / vase
3 C 3 TV / remote / phone / window
4 C 4 book / vase / car / chair
和
df.2 <- data.frame(loc = c('A','B','C'), str = c("book / chair / chair", " table / remote / vase ", "window"))
给出,
loc str
1 A book / chair / car
2 B table / remote / vase
3 C window
我想创建一个变量df.1$percentage
来计算df.1$str
中df.2$str
编辑中元素的百分比,或者:
loc person str percentage
1 A 1 door / window / table 0.00
2 B 2 window / table / toilet / vase 0.50
3 C 3 TV / remote / phone / window 0.25
4 C 4 book / vase / car / chair 0.00
(1
有0/3,2
有2/4个匹配,3
有1/4,4
有0/4)
谢谢!
答案 0 :(得分:4)
有人可能会提出一个更智能的解决方案,但这是一个简单明了的方法:
library(data.table)
dt1 = data.table(df.1, key = "loc") # set the key to match by loc
dt2 = data.table(df.2)
dt1[, percentage := dt1[dt2][, # merge
# clean up spaces and convert to strings
`:=`(str = gsub(" ", "", as.character(str)),
str.1 = gsub(" ", "", as.character(str.1)))][,
# calculate the percentage for each row
lapply(1:.N, function(i) {
tmp = strsplit(str, "/")[[i]];
sum(tmp %in% strsplit(str.1, "/")[[i]])/length(tmp)
})
]]
dt1
# loc person str percentage
#1: A 1 door / window / table 0
#2: B 2 window / table / toilet / vase 0.5
#3: C 3 TV / remote / phone / window 0.25
#4: C 4 book / vase / car / chair 0
答案 1 :(得分:4)
您可能知道,data.frame列也可以保存列表(请参阅Create a data.frame where a column is a list)。因此,您可以将str
拆分为单词列表:
df.1 <- transform(df.1, words.1 = I(strsplit(as.character(str), " / ")))
df.2 <- transform(df.2, words.2 = I(strsplit(as.character(str), " / ")))
然后合并您的数据:
m <- merge(df.1, df.2, by = "loc")
只需使用mapply
计算百分比:
transform(m, percentage = mapply(function(x, y) sum(x%in%y) / length(x),
words.1, words.2))
答案 2 :(得分:2)
另一种方式,
test <- data.frame(str1 = df.1[1:nrow(df.2),]$str, str2 = df.2$str)
df.1$percent <- NA
getwords <- function(x) { gsub(" ","",unlist(strsplit(as.character(x),"/"))) }
percent <- function(x,y) {
sum(!is.na(unlist(sapply(getwords(x), function (d) grep(d, getwords(y))))))/
length(getwords(x))
}
df.1[1:nrow(df.2),]$percent <- apply(test, 1, function(x) percent(x[1],x[2]))
> df.1
loc person str percent
# A 1 door / window / table 0.00
# B 2 window / table / toilet / vase 0.50
# C 3 TV / remote / phone / window 0.25
# C 4 book / vase / car / chair NA