如何在glmnet中绘制正确的标签?

时间:2018-06-16 17:36:53

标签: r glmnet

考虑这个例子

library(dplyr)
library(tibble)
library(glmnet)
library(quanteda)

dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              'japan'),
                     class = c(1, 1, 1, 1, 1,1,1,1,1,1,1,0,0,0,0,0,0,0,0))

我使用quanteda从此数据框中获取document term matrix

dtm <- quanteda::dfm(dtrain$text)
> dtm
Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
19 x 11 sparse Matrix of class "dfm"
        features
docs     chinese beijing shanghai this is china here hello kyoto japan tokyo
  text1        2       1        0    0  0     0    0     0     0     0     0
  text2        2       0        1    0  0     0    0     0     0     0     0
  text3        0       0        0    1  1     1    0     0     0     0     0
  text4        0       0        0    0  1     1    1     0     0     0     0
  text5        0       0        0    0  0     1    0     1     0     0     0

我可以轻松地使用lasso进行glmnet回归:

fit <- glmnet(dtm, y = as.factor(dtrain$class), alpha = 1, family = 'binomial')

然而,绘制fit并不显示dtm矩阵的标签(我只看到三条曲线)。这有什么不对?

enter image description here

1 个答案:

答案 0 :(得分:2)

据我所知,情节给你的是与重要单词相关的系数的值。在你的情况下,9-11字,京都,日本和东京(我可以从dtm表中看到)。这个正常的情节库没有我想你想说的。相反,您可以使用library(plotmo),如下所示:

library(dplyr)
library(tibble)
library(glmnet)
library(quanteda)
library(plotmo)
dtrain <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              'japan'),
                     class = c(1, 1, 1, 1, 1,1,1,1,1,1,1,0,0,0,0,0,0,0,0))


dtm <- quanteda::dfm(dtrain$text)
fit <- glmnet(dtm, y = as.factor(dtrain$class), alpha = 1, family = 'binomial')
plot_glmnet(fit, label=3)            # label the 3 biggest final coefs

The image is I hope what you were asking. Cheers !

干杯!