R - 删除会停止数据框中的单词

时间:2017-10-24 02:54:39

标签: r dataframe corpus stop-words

我正在处理文本分析。我需要计算句子。我的代码是:

library(dplyr)
library(tidytext)
txt <- readLines("consolidado.txt",encoding="UTF-8")
txt = iconv(txt, to="ASCII//TRANSLIT")
text_df <- data_frame(line = 1:392, text = txt)
palabras1 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 1)
palabras2 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 2)
palabras3 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 3)
palabras4 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 4)
palabras5 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 5)
palabras6 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 6)
palabras7 <- text_df %>%   unnest_tokens(bigram, text, token = "ngrams", n = 7)

首先,我在数据帧中转换txt,然后使用tidytext。这项工作很好,但问题是停止的话。我想删除数据框中的停止词,但我不知道如何。我试图在语料库中将其转换,但是这样做不起作用,因为虽然它后来消除了停用词,但它无法统计句子。

是否有某种方法可以删除数据帧中的停用词???

谢谢

2 个答案:

答案 0 :(得分:1)

R中的大多数文本挖掘包都包含用于删除常见停用词的标准化功能。在tidytext包中,作者包含一个包含常用停用词的stop_words数据集。这样的事情可以解决问题:

text_df <- data_frame(line = 1:392, text = txt) %>%
                      txt_df %>%
                      anti_join(stop_words)

答案 1 :(得分:1)

I tried with anti_join... but i get this error:

by required, because the data sources have no common variables

Googling about this problem I tried with:

by = NULL
by = c("a" = "b")
by = c(namecolumn = namecolumn)

and many ways more with "by", but I didn´t get it.

Finally I got it with this solution:

library(tm)
library(dplyr)
library(tidytext)

txt <- readLines("consolidado.txt",encoding="UTF-8")
txt = iconv(txt, to="ASCII//TRANSLIT")
text_df <- data_frame(line = 1:392, text = txt)

text_df$text = removeWords(text_df$text, stopwords("spanish"))
text_df$text = stripWhitespace(text_df$text)

The library tm has the spanish stopwords.

I select the column with the text in my dataframe, by default this column is called text. Later I use the function removeWords to erase the stopwords. The last line is to delete double whitespaces after to delete stopwords.

Thanks for the help.