R package mlr Multilabel Text Classification:如何对新数据进行分类

时间:2018-02-12 10:22:53

标签: r multilabel-classification mlr

我在一个关于使用包mlr进行多标签分类的教程中找到了这段代码。

library("mlr")

yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)

lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)

mod = train(lrn.br, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))

pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])

我理解数据集yeast的结构,但我不知道如何使用我想要分类的新数据时使用代码,因为那时我不会有任何TRUE或FALSE值。标签。实际上我会得到一些与yeast结构相同的训练数据,但对于我的新数据,列1:14会丢失。 我想念一些东西吗?如果不是:我如何正确使用代码?

编辑:

以下是我将如何使用代码的示例代码:

library("tm")

train.data = data.frame("id" = c(1,1,2,3,4,4), "text" = c("Monday is nice weather.", "Monday is nice weather.", "Dogs are cute.", "It is very rainy.", "My teacher is angry.", "My teacher is angry."), "label" = c("label1", "label2", "label3", "label1", "label4", "label5"))
test.data = data.frame("id" = c(5,6), "text" = c("Next Monday I will meet my teacher.", "Dogs do not like rain."))

train.data$text = as.character(train.data$text)
train.data$id = as.character(train.data$id)
train.data$label = as.character(train.data$label)
test.data$text = as.character(test.data$text)
test.data$id = as.character(test.data$id)

### Bring training data into structure
train.data$label = make.names(train.data$label)
labels = unique(train.data$label)

# DocumentTermMatrix for all texts
texts = unique(c(train.data$text, test.data$text))
docs <- Corpus(VectorSource(unique(texts)))
terms <-DocumentTermMatrix(docs)
m <- as.data.frame(as.matrix(terms))

# Logical columns for labels
test = data.frame("id" = train.data$id, "topic"=train.data$label)
test2 = as.data.frame(unclass(table(test)))
test2[,c(1:ncol(test2))] = as.logical(unlist(test2[,c(1:ncol(test2))]))
rownames(test2) = unique(test$id)

# Bind columns from dtm
termsDf = cbind(test2, m[1:nrow(test2),])
names(termsDf) = make.names(names(termsDf))

### Create Multilabel Task
classify.task = makeMultilabelTask(id = "multi", data = termsDf, target = labels)

### Now the model
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, classify.task)

### How can I predict for test.data?

所以,问题是我没有test.data的任何标签,因为这是我真正想要计算的?

EDIT2:

当我只是使用

names(m) = make.names(names(m))
pred = predict(mod, newdata = m[(nrow(termsDf)+1):(nrow(termsDf)+nrow(test.data)),])

结果是两个文本相同,而且实际上并非我所期望的。

0 个答案:

没有答案