Predict(),具有两列和不同行的NewData

时间:2018-07-10 09:42:40

标签: r multiple-columns prediction glm poisson

我正在尝试预测数据集(retweets)中的三个变量(mediacontentdf_22),以便在泊松,负二项式和负二项式之间进行选择零膨胀泊松。这三个变量之一是响应变量(retweets),另外两个是预测变量(mediacontent)。

我实现了广义线性模型,没有问题。

零膨胀的泊松数据

 library("pscl")
 summary( m0 <- zeroinfl(retweets ~ media + content, data=df_22,dist="poisson") )

泊松

summary( m1 <- glm(formula=retweets ~ media + content, data=df_22, family="poisson"(link=log)))

负二项式

library (MASS) 
summary( m2 <- glm.nb(retweets ~ media + content, data=df_22) )

但是,当我创建新数据库进行预测时。我检查它的水平。

> levels(df_22$media)
[1] "other" "pic"   "pw"    "text"  "web"

> levels(df_22$content)
[1] "cultura"     "employ"      "environment" "other"       "security"    "sport"       "transport" 

我有问题。而且两列的行是不同的。

newmedia = c("other","pic","pw","text", "web")
newcontent = c("cultura","employ","environment","other","security","sport","transport")

nd = data.frame(media = newmedia, content = newcontent)
  

data.frame中的错误(媒体= newmedia,内容= newcontent):参数暗示行数不同:5、7

我应该怎么解决这些问题?

我想解决这个问题以便能够做出这些预测,以便我可以选择三个模型中的哪个更适合我的数据。

p0 <- cbind(nd, Count = predict(m0, newdata = nd, type = "count"), Zero = predict(m0, newdata = nd, type = "zero"))

p1 <- cbind(nd, Mean = predict(m1, newdata = nd, type="response"), SE = predict(m1, newdata = nd, type="response", se.fit=T)$se.fit)

p2 <- cbind(nd, Mean = predict(m2, newdata = nd, type="response"), SE = predict(m2, newdata = nd, type="response", se.fit=T)$se.fit)

1 个答案:

答案 0 :(得分:0)

在下面的code集下创建一个data,它计算p0p1p2nb dataframe的创建方式与test dataframe不同。

导入库

library(pscl)
library (MASS)

创建样本数据集

media <- c("other", "pic",   "pw",    "text",  "web")
content <- c("cultura", "employ", "environment", "other", "security", "sport", "transport")

set.seed(1)
retweets <- floor(abs(1e4*rnorm(1000)))
temp_index <- which(retweets %in% sample(retweets, 20)) # sample indexes
retweets[temp_index] <- 0 # set some retweets to zero to run zeroinfl()
df <- data.frame(retweets)
df$media <- sample(media, 1000, replace = TRUE)
df$content <- sample(content, 1000, replace = TRUE)
head(df)

unique(df$media)
unique(df$content)

创建测试数据集

注意:此处,测试数据集是从训练数据中提取的,仅用于说明目的。理想情况下,它应该是一组新数据。

nd = df[sample(nrow(df), 300), ] # ideally this should not be from the train data, this is just for an example code
nd_X <- test[,c('media', 'content')]
nd_Y <- test[,c('retweets')]

适合的型号:zeroinf(dist='poisson')glm(family='poisson')glm.nb()

# Poisson
summary( m0 <- zeroinfl(retweets ~ media + content, data=df, dist="poisson") )

# Binomial
summary( m1 <- glm(formula=retweets ~ media + content, data=df, family="poisson"(link=log)))

# glm()
#summary( m2 <- glm.nb(retweets ~ media + content, data=df) )  # gives error in summary due to zeros
summary( m2 <- glm.nb(retweets ~ media + content, data=df[df$retweets!=0,]) ) # no error without zeros

Predict使用test data设置

p0 <- cbind(nd, Count = predict(m0, newdata = nd_X, type = "count"), Zero = predict(m0, newdata = nd, type = "zero"))
p1 <- cbind(nd, Mean = predict(m1, newdata = nd_X, type="response"), SE = predict(m1, newdata = nd, type="response", se.fit=T)$se.fit)
p2 <- cbind(nd, Mean = predict(m2, newdata = nd_X, type="response"), SE = predict(m2, newdata = nd, type="response", se.fit=T)$se.fit)

输出:

enter image description here