您好我正在使用tidy_text格式,我正在尝试替换字符串"电子邮件"和#34;通过电子邮件发送"进入"电子邮件"。
set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>%
unnest_tokens(word, sentence)
tidy_df %>%
count(word, sort = TRUE) %>%
filter( n > 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
这很好用,但是当我使用时:
tidy_df <- gsub("emailing", "email", tidy_df)
替换单词并再次运行条形图我收到以下错误消息:
UseMethod出错(&#34; group_by _&#34;): 没有适用于&#39; group_by _&#39;的方法应用于类&#34;字符&#34;
的对象有没有人知道如何在不改变tidy_text的结构/类的情况下,在整洁的文本格式中轻松替换单词?
答案 0 :(得分:9)
删除像这样的单词的结尾称为词干,如果您愿意,R中有几个包可以为您执行此操作。一个是hunspell package from rOpenSci,另一个选项是SnowballC包,它实现了Porter算法干预。你会这样实现:
library(dplyr)
library(tidytext)
library(SnowballC)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 i
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 i
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
请注意,它会阻止所有你的文字,并且某些词语不再是真正的单词;你可能会或可能不会关心这一点。
如果您不想使用像SnowballC或hunspell这样的词干分析器阻止所有文本,您可以使用if_else
中的dplyr mutate()
来替换特定词。
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows
或者使用stringr包中的str_replace
可能更有意义。
library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
unnest_tokens(word, txt) %>%
mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#> word
#> <chr>
#> 1 email
#> 2 is
#> 3 fun
#> 4 broken
#> 5 modem
#> 6 email
#> 7 is
#> 8 fun
#> 9 broken
#> 10 modem
#> # ... with 243 more rows