根据单词出现次数创建分数

时间:2017-06-18 19:40:56

标签: r regex

我有两个数据框,其中包含单词列和这些单词的相关分数。我希望通过这些框架运行评论,并根据单词是否出现在句子中来创建附加分数。

我想在很多很多评论中做到这一点,因此它需要具有计算效率。例如,句子"嗨,他说。为什么没关系"得分为.98 + .1 + .2,因为单词" hi","为什么","好的"在数据框架中。任何句子都可能包含来自多个数据框的单词。

任何人都可以帮我创建专栏" add_score"使用可以很好地扩展到大型数据帧的过程?谢谢

a <- data.frame(words = c("hi","no","okay","why"),score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here",score = c(.5,.3,.2)))
comment_df = data.frame(id = c("1","2","3"),  comments = c("hi, he said. why 
is it okay","okay okay okay no","yes, here is it"))
comment_df$add_score = c(1.28,1.1,.5)

1 个答案:

答案 0 :(得分:2)

此解决方案使用tidyversestringr中的函数。

# Load packages
library(tidyverse)
library(stringr)

# Merge a and b to create score_df
score_df <- bind_rows(a, b)

# Create a function to calculate score for one string
string_cal <- function(string, score_df){

  temp <- score_df %>%
    # Count the number of words in one string
    mutate(Number = str_count(string, pattern = fixed(words))) %>%
    # Calcualte the score
    mutate(Total_Score = score * Number) 

  # Return the sum
  return(sum(temp$Total_Score))
}

# Use map_dbl to apply the string_cal function over comments
# The results are stored in the add_score column
comment_df <- comment_df %>%
  mutate(add_score = map_dbl(comments, string_cal, score_df = score_df))

数据准备

a <- data.frame(words = c("hi","no","okay","why"),
                score = c(.98,.5,.2,.1))
b <- data.frame(words = c("bye","yes","here"),
                score = c(.5,.3,.2))
comment_df <- data.frame(id = c("1","2","3"),
                         comments = c("hi, he said. why is it okay",
                                      "okay okay okay no",
                                      "yes, here is it"))