Question

我有带有注释列的调查数据。我希望对回应进行情绪分析。问题在于数据中有很多语言，我想不通如何从集合中消除多个语言停用词

'nps'是我的数据源，nps $ customer.feedback是评论列。

首先，我将数据标记化

#TOKENISE
comments <- nps %>% 
  filter(!is.na(cusotmer.feedback)) %>% 
  select(cat, Comment) %>% 
  group_by(row_number(), cat) 

  comments <- comments %>% ungroup()

摆脱停用词

nps_words <-  nps_words %>% anti_join(stop_words, by = c('word'))

然后使用Stemming和get_sentimets（“ bing”）按情感显示字数。

 #stemgraph
  nps_words %>% 
  mutate(word = wordStem(word)) %>% 
  inner_join(get_sentiments("bing") %>% mutate(word = wordStem(word)), by = 
  c('word')) %>%
  count(cat, word, sentiment) %>%
  group_by(cat, sentiment) %>%
  top_n(7) %>%
  ungroup() %>%
  ggplot(aes(x=reorder(word, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  facet_wrap( ~cat, scales = "free")  +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Word counts by Sentiment by Category - Bing (Stemmed)", x = 
  `"Words", y = "Count")`

但是，由于分析了德语文本，因此“ di”和“ die”出现在“负”图中。

有人可以帮忙吗？

我的目标是使用上述代码来消除德语停用词。

Answer 1

要回答您的问题，您可以这样做以删除德语停用词。使用停用词包：

your code
.....  
stop_german <- data.frame(word = stopwords::stopwords("de"), stringsAsFactors = FALSE)

nps_words <-  nps_words %>% 
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_german, by = c("word"))

...
rest of code

BUT ，意识到整洁的文本主要用于英语，而不是其他语言。用德语文本进行词干和情绪分析会给您不正确的结果。必应情感仅适用于英语单词。像您一样进行inner_join会删除大多数德语单词，因为英语中没有与此匹配的单词。但是有些匹配项，例如单词“ die”（如果使用德语停用词，则将其删除，表示“谁”或“那个”）。但是，如果您删除该单词，则可能会不小心删除英文的“ die”（死亡）。

This SO post提供了有关德国情绪分析的更多信息。

删除R中的德语停用词

1 个答案: