Question

我正在使用R，我有一个包含3个变量的博客文章文件：标题（帖子的标题），正文（帖子的文字））和伟大（如果帖子收到五颗星则为1，否则为0）。我已经为标题创建了一个语料库，删除了标点，将其转换为小写等，如下所示：

title = Corpus(VectorSource(posts$title))
title = tm_map(title, tolower)
title = tm_map(title, PlainTextDocument)
title = tm_map(title, removePunctuation)
title = tm_map(title, removeWords, stopwords("english"))
title = tm_map(title, stemDocument)
dtm = DocumentTermMatrix(title)
sparseTerms = removeSparseTerms(dtm, 0.99)
title = as.data.frame(as.matrix(sparseTerms))
title$great = posts$great

我为 body 变量执行了相同的过程。之后，我使用sample.split分隔训练和测试集（标题）并使用glm()函数来使用逻辑回归：

library(caTools)
spl = sample.split(title$great, 0.7)
train = subset(title, spl = TRUE)
test = subset(title, spl = FALSE) 
Log = glm(great ~ ., data=train, family=binomial)
summary(Log)

现在，当我使用summary()时，我可以看到哪些变量是显着的（3星）。而且，正如您所看到的，我只使用glm()处的标题变量。所以我想知道：

如何仅保留3星的变量？
如何组合标题和正文语料库，以便我可以将glm（）函数与所有数据一起使用？

提前致谢。

Answer 1

如果仔细查看summary(...)$coefficients返回的对象，您会发现它只是一个矩阵，其截距和预测变量为行名，Estimate，{{ 1}}，Std. Error和z value作为列名。您感兴趣的列名是最后一个（第四个）。使用您在代码中为逻辑模型创建的名称：

Pr(>|z|)

至于你的第二个问题：以你所说的方式组合两个（小到中等大小）数据框的简单方法是使用# Get the coefficient matrix coefs <- summary(Log)$coefficients # Identify the variables with "3 stars" vars <- rownames(coefs)[which(coefs[, 4] < 0.001)]。

如何只保留r中的高度重要变量并组合两个完整的语料库来进行预测？

1 个答案: