将数据转换为R中xgboost的格式?

时间:2017-04-08 20:46:10

标签: r xgboost

有人有一个非常好的解释示例,将数据转换为R中xgboost可用的格式吗?

get started doc对我没有帮助。数据(public class SpamPhoneNumberManager { private Set<String> spamPhoneNumbers; private final Timer timer; private volatile boolean isUpdating = false; public SpamPhoneNumberManager() { this.timer = new Timer("updater", true); this.timer.scheduleAtFixedRate(new TimerTask() { @Override public void run() { SpamPhoneNumberManager.this.updateSpamPhoneNumbers(); } }, 0, 1000 * 60 * 60 * 24 * 7);// one week // week } public Set<String> getSpamPhoneNumbers() { if(isUpdating){ // here is your decision what to do, or wait blocking until is updated, or return an old copy, or exception to retry later } return this.spamPhoneNumbers; } private void updateSpamPhoneNumbers() { this.isUpdating = true; Set<String> newSpamPhoneNumbers = new HashSet<>(); // populate set from file on server this.spamPhoneNumbers = Collections.unmodifiableSet(newSpamPhoneNumbers); this.isUpdating = false; } } agaricus.train)已采用专门格式(agaricus.test):

dgCMatrix

我看到this example code使用sparse.model.matrix,但我仍然很难将相当简单的数据整合到xgboost需要的格式中。

例如,假设我有两个数据框:> str(agaricus.train) List of 2 $ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots .. ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... .. ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ... .. ..@ Dim : int [1:2] 6513 126 .. ..@ Dimnames:List of 2 .. .. ..$ : NULL .. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ... .. ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... .. ..@ factors : list() $ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ... words

labels数据框有wordssentence_id,每个句子有一个或多个单词。

word_id数据框有一个sentence_id和标签(例如,对于二进制分类任务,为0或1)。

如何将数据转换为预测句子标签的格式?

我可以分开训练和测试。

编辑:最简单的单词和data_label:

data_label

1 个答案:

答案 0 :(得分:2)

xgb.DMatrix的输入可以是密集matrix,也可以是稀疏dgCMatrix,也可以是以LibSVM格式存储在文件中的稀疏数据。由于您正在处理文本数据,稀疏表示将是最合适的。 下面是如何将示例数据转换为dgCMatrix的示例。 在这里,我假设一个完美的情况,连续的整数句子_句子从1开始,在两个表中是相同的。如果在实践中不是这样,那么你需要更多的数据。

library(Matrix)

words <- data.frame(sentence_id=c(1, 1, 2, 2, 2),
                    word_id=c(1, 2, 1, 3, 4))
data_label <- data.frame(sentence_id=c(1, 2), label=c(0, 1))

# quick check of assumptions about sentence_id
stopifnot(min(words$sentence_id) == 1 &&
          max(words$sentence_id) == length(unique(words$sentence_id)))

# sparse matrix construction from "triplet" data
# (rows are sentences, columns are words, and the value is always 1)
smat <- sparseMatrix(i = words$sentence_id, j = words$word_id, x = 1)

# make sure sentence_id are in proper order in data_label:
data_label <- data_label[order(data_label$sentence_id)]
stopifnot(all.equal(data_label$sentence_id, 1:nrow(smat)))

xmat <- xgb.DMatrix(smat, label = data_label$label)