整齐的文本格式内的单词替换

时间:2017-04-11 11:02:57

标签: r text-mining tidytext

您好我正在使用tidy_text格式,我正在尝试替换字符串"电子邮件"和#34;通过电子邮件发送"进入"电子邮件"。

set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>% 
unnest_tokens(word, sentence)

tidy_df %>% 
count(word, sort = TRUE) %>% 
filter( n > 20) %>% 
mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) + 
coord_flip()

这很好用,但是当我使用时:

 tidy_df <- gsub("emailing", "email", tidy_df)

替换单词并再次运行条形图我收到以下错误消息:

UseMethod出错(&#34; group_by _&#34;):   没有适用于&#39; group_by _&#39;的方法应用于类&#34;字符&#34;

的对象

有没有人知道如何在不改变tidy_text的结构/类的情况下,在整洁的文本格式中轻松替换单词?

1 个答案:

答案 0 :(得分:9)

删除像这样的单词的结尾称为词干,如果您愿意,R中有几个包可以为您执行此操作。一个是hunspell package from rOpenSci,另一个选项是SnowballC包,它实现了Porter算法干预。你会这样实现:

library(dplyr)
library(tidytext)
library(SnowballC)

terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2       i
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7       i
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

请注意,它会阻止所有你的文字,并且某些词语不再是真正的单词;你可能会或可能不会关心这一点。

如果您不想使用像SnowballC或hunspell这样的词干分析器阻止所有文本,您可以使用if_else中的dplyr mutate()来替换特定词。

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

或者使用stringr包中的str_replace可能更有意义。

library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows