Question

我尝试使用Quanteda提取主要特征，但结果是修饰词，即“ faulti”而不是“ faulty”。这应该是预期的结果吗？

我尝试在原始数据集中搜索最热门的特征关键字，但与预期不符。

编辑：如果我为函数dfm（）设置了选项stem = FALSE，则关键字恢复为普通单词。

library(quanteda)    
corpus1 = corpus(as.character(training_data$Elec_rmk))
kwic(corpus1, 'faulty')

#[text25701, 4]              Convertible roof sometime | faulty | . SD card missing.               
#[text25701, 22]              unavailable). Pilot lamp | faulty | .  

dfm1 <- dfm(
  corpus1, 
  ngrams = 1, 
  remove = stopwords("english"),
  remove_punct = TRUE,
  remove_numbers = TRUE,
  stem = TRUE)
tf1 <- topfeatures(dfm1, n = 10)
tf1
# key words were modified/truncated words?
#faulti malfunct    light    damag     miss    cover     rear     loos     lamp    plate 
#   562      523      454      337      331      325      295      259      250      238 

library(stringr)
sum(str_detect(training_data$Elec_rmk, 'faulti')) # 0
sum(str_detect(training_data$Elec_rmk, 'faulty')) # 495

Answer 1

但是您似乎误解了dfm返回的内容和str_detect返回的内容。 topfeatures仅检测句子中是否存在搜索字符串，而不检测次数。您的总和仅计算句子中单词的存在（495）。 str_detect计算单词在文本中实际出现的次数（562）。

请看以下示例以了解区别：

topfeatures

对于第一个示例，# 1 line of text (paragraph) my_text <- "I have two examples of two words in this text. Isn't having two words fun?" topfeatures(dfm(my_text, remove = stopwords("english"), remove_punct = TRUE), n = 2) two words 3 2 sum(str_detect(my_text, "two")) [1] 1 # 2 sentences. my_text2 <- c("I have two examples of two words in this text.", "Isn't having two words fun?") topfeatures(dfm(my_text2, remove = stopwords("english"), remove_punct = TRUE), n = 2) two words 3 2 sum(str_detect(my_text2, "two")) [1] 2对于单词“ two”返回3，topfeatures仅返回1。str_detect只需输入一个矢量/一段文本

对于第二个示例，str_detect再次为单词“ two”返回3。 topfeatures现在返回2，向量中有2个值，因此它在两个句子中都检测到单词“ two”，但仍然比实际的3个还短。

rQuanteda主要特征提取返回修饰词

1 个答案: