R:计算一列中每个唯一字符的出现频率

时间:2019-04-27 21:39:54

标签: r

我有一个数据框df,其中包含名为strings的列。此列中的值是一些句子。

例如:

id    strings
1     "I want to go to school, how about you?"
2     "I like you."
3     "I like you so much"
4     "I like you very much"
5     "I don't like you"

现在,我有一个停用词列表,

["I", "don't" "you"]

如何制作另一个数据帧,该数据帧在上一个数据帧的列中存储每个唯一单词(停用词除外)的总数。

keyword      frequency
  want            1
  to              2
  go              1
  school          1
  how             1
  about           1
  like            4
  so              1
  very            1
  much            2

我的想法是:

  1. 将列中的字符串组合成一个大字符串。
  2. 列出一个在大字符串中存储唯一字符的列表。
  3. 使df的一栏为唯一字。
  4. 计算频率。

但这似乎效率很低,我不知道该如何编写代码。

3 个答案:

答案 0 :(得分:1)

一种方法是使用tidytext。这里是book和代码

library("tidytext")
library("tidyverse")

#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))

df %>% 
  mutate(strings = as.character(strings)) %>% 
  unnest_tokens(word, string) %>%   #this tokenize the strings and extract the words
  filter(!word %in% c("I", "i", "don't", "you")) %>% 
  count(word)

#> # A tibble: 11 x 2
#>    word       n
#>    <chr>  <int>
#>  1 about      1
#>  2 go         1
#>  3 how        1
#>  4 like       4
#>  5 much       2

编辑

所有标记都转换为小写,因此您可以在stop_words中包含i或在lower_case = FALSE中添加参数unnest_tokens

答案 1 :(得分:0)

假设您有一个mystring对象和一个stopWords向量,则可以这样操作:

# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]

# remove stopwords from the vector
vector = vector[!vector %in% stopWords]

此时,您可以将频率table()变成dataframe对象:

frequency_df = data.frame(table(words))

让我知道这是否可以帮助您。

答案 2 :(得分:0)

首先,您可以通过str_split创建所有单词的向量,然后创建单词的频率表。

library(stringr)
stop_words <- c("I", "don't", "you")

# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))

# create a frequency table 
word_list <- as.data.frame(table(all_words))

# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
相关问题