我尝试使用Quanteda提取主要特征,但结果是修饰词,即“ faulti”而不是“ faulty”。这应该是预期的结果吗?
我尝试在原始数据集中搜索最热门的特征关键字,但与预期不符。
编辑:如果我为函数dfm()设置了选项stem = FALSE,则关键字恢复为普通单词。
library(quanteda)
corpus1 = corpus(as.character(training_data$Elec_rmk))
kwic(corpus1, 'faulty')
#[text25701, 4] Convertible roof sometime | faulty | . SD card missing.
#[text25701, 22] unavailable). Pilot lamp | faulty | .
dfm1 <- dfm(
corpus1,
ngrams = 1,
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
stem = TRUE)
tf1 <- topfeatures(dfm1, n = 10)
tf1
# key words were modified/truncated words?
#faulti malfunct light damag miss cover rear loos lamp plate
# 562 523 454 337 331 325 295 259 250 238
library(stringr)
sum(str_detect(training_data$Elec_rmk, 'faulti')) # 0
sum(str_detect(training_data$Elec_rmk, 'faulty')) # 495
答案 0 :(得分:1)
var routeApp = angular.module('HondaSurabayaCenter', ['ngRoute', 'ngCookies']);
routeApp.controller('LoginController', ['$scope', '$cookies', '$cookieStore', '$http', function($scope, $cookies, $cookieStore, $http) {
function makeid() {
var text = "";
var possible = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
for (var i = 0; i < 10; i++)
text += possible.charAt(Math.floor(Math.random() * possible.length));
return text;
}
var starttime = performance.now();
var randomText = makeid();
var timestamp = Date.now();
var cookie_name1 = "identifier";
$cookies.put(cookie_name1, randomText+timestamp);
console.log("identifier : "+$cookies.get(cookie_name1));
var data = {identifier: $cookies.get(cookie_name1)};
}]);
routeApp.controller('HttpServiceController', ['$cookies', '$http', function($cookies, $http) {
$http.get('http://app.hondaeastjava.com/api/honda/user/keys', {params: {identifier: $cookies.get('identifier')}, headers: {'Content-Type': 'application/x-www-form-urlencoded'}}).
then(function successCallback(response){
console.log("Data submitted successfully");
console.log(response.data);
$cookies.put("publickey", response.data.result.publickey);
var endtime = performance.now();
console.log("Time took to do http get : "+(endtime-starttime));
}, function errorCallback(response){
console.log("Service not found");
})
}]);
在默认情况下不会阻止。但是,您将词干选项设置为TRUE hency“ faulti”。但是,正如您在编辑评论中提到的那样,将此设置为FALSE(或省略此设置)将返回未加词条的单词。
但是您似乎误解了dfm
返回的内容和str_detect
返回的内容。 topfeatures
仅检测句子中是否存在搜索字符串,而不检测次数。您的总和仅计算句子中单词的存在(495)。 str_detect
计算单词在文本中实际出现的次数(562)。
请看以下示例以了解区别:
topfeatures
对于第一个示例,# 1 line of text (paragraph)
my_text <- "I have two examples of two words in this text. Isn't having two words fun?"
topfeatures(dfm(my_text, remove = stopwords("english"), remove_punct = TRUE), n = 2)
two words
3 2
sum(str_detect(my_text, "two"))
[1] 1
# 2 sentences.
my_text2 <- c("I have two examples of two words in this text.", "Isn't having two words fun?")
topfeatures(dfm(my_text2, remove = stopwords("english"), remove_punct = TRUE), n = 2)
two words
3 2
sum(str_detect(my_text2, "two"))
[1] 2
对于单词“ two”返回3,topfeatures
仅返回1。str_detect
只需输入一个矢量/一段文本
对于第二个示例,str_detect
再次为单词“ two”返回3。 topfeatures
现在返回2,向量中有2个值,因此它在两个句子中都检测到单词“ two”,但仍然比实际的3个还短。