rQuanteda主要特征提取返回修饰词

时间:2018-08-21 03:59:14

标签: r quanteda

我尝试使用Quanteda提取主要特征,但结果是修饰词,即“ faulti”而不是“ faulty”。这应该是预期的结果吗?

我尝试在原始数据集中搜索最热门的特征关键字,但与预期不符。

编辑:如果我为函数dfm()设置了选项stem = FALSE,则关键字恢复为普通单词。

library(quanteda)    
corpus1 = corpus(as.character(training_data$Elec_rmk))
kwic(corpus1, 'faulty')

#[text25701, 4]              Convertible roof sometime | faulty | . SD card missing.               
#[text25701, 22]              unavailable). Pilot lamp | faulty | .  

dfm1 <- dfm(
  corpus1, 
  ngrams = 1, 
  remove = stopwords("english"),
  remove_punct = TRUE,
  remove_numbers = TRUE,
  stem = TRUE)
tf1 <- topfeatures(dfm1, n = 10)
tf1
# key words were modified/truncated words?
#faulti malfunct    light    damag     miss    cover     rear     loos     lamp    plate 
#   562      523      454      337      331      325      295      259      250      238 

library(stringr)
sum(str_detect(training_data$Elec_rmk, 'faulti')) # 0
sum(str_detect(training_data$Elec_rmk, 'faulty')) # 495

1 个答案:

答案 0 :(得分:1)

var routeApp = angular.module('HondaSurabayaCenter', ['ngRoute', 'ngCookies']); routeApp.controller('LoginController', ['$scope', '$cookies', '$cookieStore', '$http', function($scope, $cookies, $cookieStore, $http) { function makeid() { var text = ""; var possible = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; for (var i = 0; i < 10; i++) text += possible.charAt(Math.floor(Math.random() * possible.length)); return text; } var starttime = performance.now(); var randomText = makeid(); var timestamp = Date.now(); var cookie_name1 = "identifier"; $cookies.put(cookie_name1, randomText+timestamp); console.log("identifier : "+$cookies.get(cookie_name1)); var data = {identifier: $cookies.get(cookie_name1)}; }]); routeApp.controller('HttpServiceController', ['$cookies', '$http', function($cookies, $http) { $http.get('http://app.hondaeastjava.com/api/honda/user/keys', {params: {identifier: $cookies.get('identifier')}, headers: {'Content-Type': 'application/x-www-form-urlencoded'}}). then(function successCallback(response){ console.log("Data submitted successfully"); console.log(response.data); $cookies.put("publickey", response.data.result.publickey); var endtime = performance.now(); console.log("Time took to do http get : "+(endtime-starttime)); }, function errorCallback(response){ console.log("Service not found"); }) }]); 在默认情况下不会阻止。但是,您将词干选项设置为TRUE hency“ faulti”。但是,正如您在编辑评论中提到的那样,将此设置为FALSE(或省略此设置)将返回未加词条的单词。

但是您似乎误解了dfm返回的内容和str_detect返回的内容。 topfeatures仅检测句子中是否存在搜索字符串,而不检测次数。您的总和仅计算句子中单词的存在(495)。 str_detect计算单词在文本中实际出现的次数(562)。

请看以下示例以了解区别:

topfeatures

对于第一个示例,# 1 line of text (paragraph) my_text <- "I have two examples of two words in this text. Isn't having two words fun?" topfeatures(dfm(my_text, remove = stopwords("english"), remove_punct = TRUE), n = 2) two words 3 2 sum(str_detect(my_text, "two")) [1] 1 # 2 sentences. my_text2 <- c("I have two examples of two words in this text.", "Isn't having two words fun?") topfeatures(dfm(my_text2, remove = stopwords("english"), remove_punct = TRUE), n = 2) two words 3 2 sum(str_detect(my_text2, "two")) [1] 2 对于单词“ two”返回3,topfeatures仅返回1。str_detect只需输入一个矢量/一段文本

对于第二个示例,str_detect再次为单词“ two”返回3。 topfeatures现在返回2,向量中有2个值,因此它在两个句子中都检测到单词“ two”,但仍然比实际的3个还短。