我有两个数据框,其中包含单词列和这些单词的相关分数。我希望通过这些框架运行评论,并根据单词是否出现在句子中来创建附加分数。
我想在很多很多评论中做到这一点,因此它需要具有计算效率。例如,句子"嗨,他说。为什么没关系"得分为.98 + .1 + .2,因为单词" hi","为什么","好的"在数据框架中。任何句子都可能包含来自多个数据框的单词。
任何人都可以帮我创建专栏" add_score"使用可以很好地扩展到大型数据帧的过程?谢谢
a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"), comments = c("hi, he said. why
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)
答案 0 :(得分:2)
此解决方案使用tidyverse
和stringr
中的函数。
# Load packages
library(tidyverse)
library(stringr)
# Merge a and b to create score_df
score_df <- bind_rows(a, b)
# Create a function to calculate score for one string
string_cal <- function(string, score_df){
temp <- score_df %>%
# Count the number of words in one string
mutate(Number = str_count(string, pattern = fixed(words))) %>%
# Calcualte the score
mutate(Total_Score = score * Number)
# Return the sum
return(sum(temp$Total_Score))
}
# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))
a <- data.frame(words = c("hi","no","okay","why"),
score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
comments = c("hi, he said. why is it okay",
"okay okay okay no",
"yes, here is it"))