我使用 R.3.2.1 上的 maxent R软件包进行监督分类和培训1,000,000条推文,其中25%用于测试。其中Tweet
是预测变量,City
是标签。 Linux内核在Centos集群平台上运行,每个内核至少有128GB RAM。记忆不是问题。
这是我的R代码:
library(maxent)
file <- read.csv("JoinedTable.csv")
data <- file[sample(1:3700000,size=1000000,replace=FALSE),]
matrix <- create_matrix(data$Tweet, language="english", stripWhitespace = TRUE, toLower = TRUE, stemWords=FALSE, removePunctuation = TRUE, removeStopwords=TRUE, removeNumbers=TRUE, removeSparseTerms=.998)
sparse2 <- as.compressed.matrix(matrix)
model <- maxent(sparse2[1:750000,],as.factor(data$CIty)[1:750000])
results <- predict(model,sparse2[750001:1000000,])
此处返回错误消息:
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: .External(list(name = "InternalFunction_invoke", address = <pointer: 0x2a3d5750>, dll = list(name = "Rcpp", path = "/users/40113951/gridware/share/R/3.2.1/Rcpp/libs/Rcpp.so", dynamicLookup = TRUE, handle = <pointer: 0x451c3e90>, info = <pointer: 0x7fe0c5ecb940>), numParameters = -1L), <pointer: 0x42b1aea0>, ...)
2: maximumentropy$classify_samples(as.integer(feature_matrix@dimension[1]), as.integer(feature_matrix@dimension[2]), feature_matrix@ia, ja, feature_matrix@ra, model)
3: classify_maxent(feature_matrix, object@model)
4: predict.maxent(model, sparse2[750001:1e+06, ])
5: predict(model, sparse2[750001:1e+06, ])
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
答案 0 :(得分:0)
我发现了错误的来源,这是由于代码第6行的拼写错误;预测变量写为CIty
而不是City
但是,这导致了另一条错误消息。我已通知软件包维护者,但未收到任何响应。这是否意味着MAXENT包不能处理超过255个唯一标签。阅读文档here,不要提及有关标签数量的限制
[1] "ERROR: Too many types of labels (>255 unique labels)."