自定义停用词列表删除

时间:2018-02-01 21:11:25

标签: r quanteda

我尝试使用自定义单词列表从文本中删除短语。

这是一个可重复的例子。

我认为我的尝试是不对的:

mystop <-  structure(list(stopwords = c("remove", "this line", "remove this line", 
"two lines")), .Names = "stopwords", class = "data.frame", row.names = c(NA, 
-4L))
df <-  structure(list(stopwords = c("Something to remove", "this line must remove two tokens", 
"remove this line must remove three tokens", "two lines to", 
"nothing here to stop")), .Names = "stopwords", class = "data.frame", row.names = c(NA, 
-5L))
> mycorpus <- corpus(df$stopwords)
> mydfm <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE), c(stopwords("SMART"), mystop$stopwords)), , ngrams = c(1,3))
> 
> 
> #convert the dfm to dataframe
> df_ngram <- data.frame(Content = featnames(mydfm), Frequency = colSums(mydfm), 
+                  row.names = NULL, stringsAsFactors = FALSE)
> 
> df_ngram
  Content Frequency
1    line         2
2  tokens         2
3   lines         1
4    stop         1
> df
                                  stopwords
1                       Something to remove
2          this line must remove two tokens
3 remove this line must remove three tokens
4                              two lines to
5                      nothing here to stop
在pf中的例子我应该期待找到类似Something to的东西吗?我的意思是看到每个文件都清晰而不删除?

我想从ngram标记中删除功能停用词。所以我试着用这个:

mydfm2 <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE, ngrams = 1:3), remove = c(stopwords("english"), mystop$stopwords)))
Error in tokens_select(x, ..., selection = "remove") : 
  unused argument (remove = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", "should", "could", "ought", "i'm", "you're", 
"he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", "when's", "where's", "why's", "how's",

使用其他示例可重现代码进行编辑: 这是我从其他问题中找到的虚拟文本:

df <- structure(list(text = c("video game consoles stereos smartphone chargers and other similar devices constantly draw power into their power supplies. Unplug all of your chargers whether it's for a tablet or a toothbrush. Electronics with standby or \\\"\\\"sleep\\\"\\\" modes: Desktop PCs televisions cable boxes DVD-ray players alarm clocks radios and anything with a remote", 
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions the impugned order is in the teeth of the recommendations of the said Committee as communicated in its letter dated 14.05.2017", 
"... focus to the ayurveda sector especially in oral care. A year ago Colgate launched its first India-focused ayurvedic brand Cibaca Vedshakti aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products including toothpaste under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian", 
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali  even though both of these have enough local and multinational competitors in the organised", 
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees it has not been able to hear cases of human rights violations in Maharashtra. A division"
)), .Names = "text", class = "data.frame", row.names = c(NA, 
-5L))

停用词(我使用quanteda的ngram创建了这个列表)

mystop <- structure(list(stop = c("dated_modern_dental", "hiring", "local", 
"employees", "modern_dental_college", "multinational", "competitors", 
"state", "dental_college_research", "organised", "human", "rights", 
"college_research_centre", "commission", "founder_increate_advisors", 
"research_centre_supra", "sector_oral_care", "left", "toothless", 
"centre_supra_authorizing")), .Names = "stop", class = "data.frame", row.names = c(NA, 
-20L))

代码中的所有步骤:

library (quanteda)
library(stringr)
#text to lower case
df$text <- tolower(df$text)
#remove all special characters
df$text <- gsub("[[:punct:]]", " ", df$text)
#remove numbers
df$text <- gsub('[0-9]+', '', df$text)
#more in order to remove regular expressions like chinese characters
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
#remove long spaces
df$text <- gsub("\\s+"," ",str_trim(df$text))

这是我制作ngrams的步骤,也可以从输入文本英语停用词中删除我的停用词列表。

myDfm <- dfm(tokens_remove(tokens(df$text, remove_punct = TRUE),  c(stopwords("SMART"), mystop$stop)), ngrams = c(1,3))

但是,如果我将myDfm转换为数据集,以查看停用词的删除是否有效并且可以再次看到它们

df_ngram <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm), 
                 row.names = NULL, stringsAsFactors = FALSE)

1 个答案:

答案 0 :(得分:2)

我会尝试提供我认为你想要的答案,虽然很难理解你的问题,因为实际的问题被埋没在与问题没有直接关系的一系列基本上不必要的步骤中。

我认为你很困惑如何删除停用词 - 在这种情况下,你提供的一些 - 并形成ngrams。

以下是如何创建语料库和停用词的字符向量。不需要列表等。请注意,这适用于 quanteda v1.0.0,现在使用停用词包作为其停用词列表。

mycorpus <- corpus(df$stopwords)
mystopwords <- c(stopwords(source = "smart"), mystop$stopwords)

现在我们可以手动构建标记,删除停用词,但在其位置留下“填充”,以防止从与之相邻的单词创建ngrams。

mytoks <- 
    tokens(mycorpus) %>%
    tokens_remove(mystopwords, padding = TRUE)
mytoks
# tokens from 5 documents.
# text1 :
# [1] "" "" ""
# 
# text2 :
# [1] ""       "line"   ""       ""       ""       "tokens"
# 
# text3 :
# [1] ""       ""       "line"   ""       ""       ""       "tokens"
# 
# text4 :
# [1] ""      "lines" ""     
# 
# text5 :
# [1] ""     ""     ""     "stop"

在此阶段,我们还可以使用tokens_ngrams()ngrams中的dfm()选项来应用ngrams。让我们使用后者。

dfm(mytoks, ngrams = c(1,3))
# Document-feature matrix of: 5 documents, 4 features (70% sparse).
# 5 x 4 sparse Matrix of class "dfm"
#        features
# docs    line tokens lines stop
#   text1    0      0     0    0
#   text2    1      1     0    0
#   text3    1      1     0    0
#   text4    0      0     1    0
#   text5    0      0     0    1

没有创建ngrams,因为您可以从上面的标记打印输出中看到,在从mystopwords向量中删除停用词后,没有剩余的标记与其他标记相邻。