我有两个数据框。第一个 - 保存在名为b的对象中:
structure(list(CONTENT = c("@myntra beautiful teamä»ç where is the winners list?",
"The best ever Puma wishlist for Workout freaks, Head over to @myntra https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good",
"I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!",
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym. https://t.co/VeRy4G3c7X https://t.co/fOpBRWCdSh",
"Check out #myPUMAcollection on @Myntra. Its perfect for a day at gym. https://t.co/VeRy4G3c7X.....",
"@DrDrupad @myntra #myPUMAcollection superb :)", "Super exclusive collection @myntra #myPUMAcollection https://t.co/Qm9dZzJdms",
"@myntra gave my best Love playing wid u Hope to win #myPUMAcollection",
"Check out PUMA Unisex Black Running Performance Gloves on Myntra! https://t.co/YD6IcvuG98 @myntra #myPUMAcollection",
"@myntra i have been mailing my issue daily since past week.All i get in reply is an auto generated assurance mail. 1st time pissed wd myntra"
), score = c(7.129, 7.08, 6.676, 5.572, 5.572, 5.535, 5.424,
5.205, 4.464, 4.245)), .Names = c("CONTENT", "score"), row.names = c(25L,
103L, 95L, 66L, 90L, 75L, 107L, 32L, 184L, 2L), class = "data.frame")
第二个数据库 - 保存在名为c:
的对象中structure(list(CONTENT = c("The best ever for workout over to myntra like if you find it good",
"i finalised buy a top myntra and found the at in feel like i so in life"
)), .Names = "CONTENT", row.names = c(103L, 95L), class = "data.frame")
我想找到第二个数据帧(c)中的每个语句,第一个数据帧(b)中最接近的匹配,并从第一个数据帧(b)返回分数。
例如,语句The best ever for workout over to myntra like if you find it good
与数据框1中的第二个语句紧密匹配,因此我应该返回分数7.080
。
我尝试使用堆栈溢出代码进行一些调整:
cp <- str_split(c$CONTENT, " ")
library(data.table)
nn <- lengths(cp) ## Or, for < R-3.2.0, `nn <- sapply(wordList, length)`
dt <- data.table(grp=rep(seq_along(nn), times=nn), X = unlist(cp), key="grp")
dt[,Score:=b$score[pmatch(X,b$CONTENT)]]
dt[!is.na(Score), list(avgScore=sum(Score)), by="grp"]
这将返回df c中只有一个语句的值。有人可以帮忙吗?
答案 0 :(得分:1)
这是使用stringsim
包中的stringdist
的一种方法。有几种method
s(算法)可供选择 - 我决定使用Jaro distance指标来计算相似性,因为它似乎可以为您的数据产生合理的结果。话虽如此,我对这个主题的体验充其量是偶然的,所以你可能想花点时间阅读并试验stringdist
提供的各种算法。
为了减少混乱,我使用这个包装函数返回给定字符串最相似(最高相似度值)元素的索引,
library(stringdist)
library(data.table)
best_match <- function(x, y, method = "jw", ...) {
which.max(stringsim(x, y, method, ...))
}
并使data.table
包含要匹配的字符串,为行方式操作添加虚拟索引:
Dt <- data.table(
MatchPhrase = df_c$CONTENT,
Idx = 1:nrow(df_c)
)
使用best_match
,添加一个包含最佳匹配索引的列(之后删除虚拟Idx
列),
Dt[, MatchIdx := best_match(df_b$CONTENT, MatchPhrase),
by = "Idx"][,Idx := NULL]
并从df_b
中提取相应的元素(我将您的数据分别从b
和c
重命名为df_b
和df_c
):
Dt[, .(Score = df_b$score[MatchIdx],
BestMatch = df_b$CONTENT[MatchIdx]),
by = "MatchPhrase"]
# MatchPhrase Score
#1: The best ever for workout over to myntra like if you find it good 7.080
#2: i finalised buy a top myntra and found the at in feel like i so in life 6.676
# BestMatch
#1: The best ever Puma wishlist for Workout freaks, Head over to @myntra https://t.co/V58Gk3EblW #MyPUMACollection Hit Like if you Find it good
#2: I finalised on buy a top from Myntra, and then I found the same top at 20% off in jabong. I feel like I've achieved so much in life!