有人有一个非常好的解释示例,将数据转换为R中xgboost可用的格式吗?
get started doc对我没有帮助。数据(public class SpamPhoneNumberManager {
private Set<String> spamPhoneNumbers;
private final Timer timer;
private volatile boolean isUpdating = false;
public SpamPhoneNumberManager() {
this.timer = new Timer("updater", true);
this.timer.scheduleAtFixedRate(new TimerTask() {
@Override
public void run() {
SpamPhoneNumberManager.this.updateSpamPhoneNumbers();
}
}, 0, 1000 * 60 * 60 * 24 * 7);// one week // week
}
public Set<String> getSpamPhoneNumbers() {
if(isUpdating){
// here is your decision what to do, or wait blocking until is updated, or return an old copy, or exception to retry later
}
return this.spamPhoneNumbers;
}
private void updateSpamPhoneNumbers() {
this.isUpdating = true;
Set<String> newSpamPhoneNumbers = new HashSet<>();
// populate set from file on server
this.spamPhoneNumbers = Collections.unmodifiableSet(newSpamPhoneNumbers);
this.isUpdating = false;
}
}
和agaricus.train
)已采用专门格式(agaricus.test
):
dgCMatrix
我看到this example code使用sparse.model.matrix,但我仍然很难将相当简单的数据整合到xgboost需要的格式中。
例如,假设我有两个数据框:> str(agaricus.train)
List of 2
$ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
.. ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
.. ..@ Dim : int [1:2] 6513 126
.. ..@ Dimnames:List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
.. ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
.. ..@ factors : list()
$ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ...
和words
。
labels
数据框有words
和sentence_id
,每个句子有一个或多个单词。
word_id
数据框有一个sentence_id和标签(例如,对于二进制分类任务,为0或1)。
如何将数据转换为预测句子标签的格式?
我可以分开训练和测试。
编辑:最简单的单词和data_label:
data_label
答案 0 :(得分:2)
xgb.DMatrix
的输入可以是密集matrix
,也可以是稀疏dgCMatrix
,也可以是以LibSVM格式存储在文件中的稀疏数据。由于您正在处理文本数据,稀疏表示将是最合适的。
下面是如何将示例数据转换为dgCMatrix的示例。
在这里,我假设一个完美的情况,连续的整数句子_句子从1开始,在两个表中是相同的。如果在实践中不是这样,那么你需要更多的数据。
library(Matrix)
words <- data.frame(sentence_id=c(1, 1, 2, 2, 2),
word_id=c(1, 2, 1, 3, 4))
data_label <- data.frame(sentence_id=c(1, 2), label=c(0, 1))
# quick check of assumptions about sentence_id
stopifnot(min(words$sentence_id) == 1 &&
max(words$sentence_id) == length(unique(words$sentence_id)))
# sparse matrix construction from "triplet" data
# (rows are sentences, columns are words, and the value is always 1)
smat <- sparseMatrix(i = words$sentence_id, j = words$word_id, x = 1)
# make sure sentence_id are in proper order in data_label:
data_label <- data_label[order(data_label$sentence_id)]
stopifnot(all.equal(data_label$sentence_id, 1:nrow(smat)))
xmat <- xgb.DMatrix(smat, label = data_label$label)