我有2个数据帧,我试图找出数据帧句子的文本中是否存在df
数据帧中的单词,那么我需要将源值变异为列,并为每个源值改变打印频率。请帮助我实现这一目标!
df <- data.frame(words = c("this","when","from","car","good"), source = c("name1", "name1","name2", "name2","name3"))
sentence <- data.frame(Textno = c(1,2,3),texts = c("when this job comes", "the car is good", "from here"))
预期产量
Textno texts name1 name2 name3
1 when this job comes 2 0 0
2 the car is good 0 1 0
3 from here 0 0 1
答案 0 :(得分:3)
您可以先将其分开成单词。
因此,将数据帧融为长格式,并与df
联接。
最后,将data.frame投射回宽格式。
sentence %>%
tidyr::separate(texts, into = paste0("word", 1:10), sep = " ", remove = FALSE) %>%
reshape2::melt(id.vars = c("Textno", "texts")) %>%
left_join(df, by = c("value" = "words")) %>%
na.omit() %>%
reshape2::dcast(Textno + texts ~ source)
Textno texts name1 name2 name3
1 when this job comes 2 0 0
2 the car is good 0 1 1
3 from here 0 1 0
答案 1 :(得分:3)
您所描述的基本上是从字典中查找单词-在进行情感分析(see)时通常会这样做。您可以使用tidytext
和dplyr
和tidyr
中的一些命令来完成此操作:
library(tidytext)
library(dplyr)
library(tidyr)
sentence %>%
unnest_tokens(output = "words", input = "texts", drop = FALSE) %>% # split up words into a tidy format
left_join(df, by = "words") %>% # join sentences and the dictionary
filter(!is.na(source)) %>% # remove cases where there was no match
count(Textno, texts, source) %>% # count the matches
pivot_wider(id_cols = c(Textno, texts), names_from = source,
values_from = n, values_fill = list(n = 0)) # tidy up your output
#> # A tibble: 3 x 5
#> Textno texts name1 name2 name3
#> <dbl> <chr> <int> <int> <int>
#> 1 1 when this job comes 2 0 0
#> 2 2 the car is good 0 1 1
#> 3 3 from here 0 1 0
*我在创建两个示例stringsAsFactors = FALSE
时设置了data.frames
。
在quanteda
中也有可能(并且在较大的对象上可能更快):
library(quanteda)
dict <- df %>%
group_by(source) %>%
summarise(words = list(words)) %>%
select(word = words, sentiment = source) %>% # quanteda expects a very particular format when creating a dictionary
as.dictionary()
corpus(sentence, docid_field = "Textno", text_field = "texts") %>%
dfm(dictionary = dict) %>% # this creates a document feature matrix but only with words from the dictionary
convert("data.frame")
#> document name1 name2 name3
#> 1 1 2 0 0
#> 2 2 0 1 1
#> 3 3 0 1 0
或者您可以尝试stringr
以获得更手动的方法:
res <- lapply(unique(df$source), function(src) { # loop over every source
stringr::str_count(sentence$texts, pattern = paste0(df$words[df$source == src], collapse = "|")) # count number of times a word from the source appears
})
names(res) <- unique(df$source) # name the resulting list, which fives you nice column names later
cbind(sentence, res) # binding the list to your data.frame
#> Textno texts name1 name2 name3
#> 1 1 when this job comes 2 0 0
#> 2 2 the car is good 0 1 1
#> 3 3 from here 0 1 0