找到存在的字符串并更新频率

时间:2019-11-20 15:54:50

标签: r

我有2个数据帧,我试图找出数据帧句子的文本中是否存在df数据帧中的单词,那么我需要将源值变异为列,并为每个源值改变打印频率。请帮助我实现这一目标!

df <- data.frame(words = c("this","when","from","car","good"), source = c("name1", "name1","name2", "name2","name3"))

sentence <- data.frame(Textno = c(1,2,3),texts = c("when this job comes", "the car is good", "from here"))

预期产量

Textno  texts                name1 name2 name3
  1     when this job comes    2     0     0
  2     the car is good        0     1     0
  3     from here              0     0     1

2 个答案:

答案 0 :(得分:3)

您可以先将其分开成单词。 因此,将数据帧融为长格式,并与df联接。 最后,将data.frame投射回宽格式。

sentence %>% 
  tidyr::separate(texts, into = paste0("word", 1:10), sep = " ", remove = FALSE) %>% 
  reshape2::melt(id.vars = c("Textno", "texts")) %>% 
  left_join(df, by = c("value" = "words")) %>% 
  na.omit() %>% 
  reshape2::dcast(Textno  + texts ~ source)


Textno               texts name1 name2 name3
1      when this job comes     2     0     0
2          the car is good     0     1     1
3                from here     0     1     0

答案 1 :(得分:3)

您所描述的基本上是从字典中查找单词-在进行情感分析(see)时通常会这样做。您可以使用tidytextdplyrtidyr中的一些命令来完成此操作:

library(tidytext)
library(dplyr)
library(tidyr)
sentence %>% 
  unnest_tokens(output = "words", input = "texts", drop = FALSE) %>% # split up words into a tidy format
  left_join(df, by = "words") %>% # join sentences and the dictionary
  filter(!is.na(source)) %>% # remove cases where there was no match
  count(Textno, texts, source) %>%  # count the matches
  pivot_wider(id_cols = c(Textno, texts), names_from = source, 
              values_from = n, values_fill = list(n = 0)) # tidy up your output
#> # A tibble: 3 x 5
#>   Textno texts               name1 name2 name3
#>    <dbl> <chr>               <int> <int> <int>
#> 1      1 when this job comes     2     0     0
#> 2      2 the car is good         0     1     1
#> 3      3 from here               0     1     0

*我在创建两个示例stringsAsFactors = FALSE时设置了data.frames

quanteda中也有可能(并且在较大的对象上可能更快):

library(quanteda)
dict <- df %>% 
  group_by(source) %>%
  summarise(words = list(words)) %>% 
  select(word = words, sentiment = source) %>% # quanteda expects a very particular format when creating a dictionary
  as.dictionary()

corpus(sentence, docid_field = "Textno", text_field = "texts") %>% 
  dfm(dictionary = dict) %>% # this creates a document feature matrix but only with words from the dictionary
  convert("data.frame")
#>   document name1 name2 name3
#> 1        1     2     0     0
#> 2        2     0     1     1
#> 3        3     0     1     0

或者您可以尝试stringr以获得更手动的方法:

res <- lapply(unique(df$source), function(src) { # loop over every source
  stringr::str_count(sentence$texts, pattern = paste0(df$words[df$source == src], collapse = "|")) # count number of times a word from the source appears
})
names(res) <- unique(df$source) # name the resulting list, which fives you nice column names later

cbind(sentence, res) # binding the list to your data.frame
#>   Textno               texts name1 name2 name3
#> 1      1 when this job comes     2     0     0
#> 2      2     the car is good     0     1     1
#> 3      3           from here     0     1     0