我正在尝试预测数据集(retweets
)中的三个变量(media
,content
,df_22
),以便在泊松,负二项式和负二项式之间进行选择零膨胀泊松。这三个变量之一是响应变量(retweets
),另外两个是预测变量(media
,content
)。
我实现了广义线性模型,没有问题。
零膨胀的泊松数据
library("pscl")
summary( m0 <- zeroinfl(retweets ~ media + content, data=df_22,dist="poisson") )
泊松
summary( m1 <- glm(formula=retweets ~ media + content, data=df_22, family="poisson"(link=log)))
负二项式
library (MASS)
summary( m2 <- glm.nb(retweets ~ media + content, data=df_22) )
但是,当我创建新数据库进行预测时。我检查它的水平。
> levels(df_22$media)
[1] "other" "pic" "pw" "text" "web"
> levels(df_22$content)
[1] "cultura" "employ" "environment" "other" "security" "sport" "transport"
我有问题。而且两列的行是不同的。
newmedia = c("other","pic","pw","text", "web")
newcontent = c("cultura","employ","environment","other","security","sport","transport")
nd = data.frame(media = newmedia, content = newcontent)
data.frame中的错误(媒体= newmedia,内容= newcontent):参数暗示行数不同:5、7
我应该怎么解决这些问题?
我想解决这个问题以便能够做出这些预测,以便我可以选择三个模型中的哪个更适合我的数据。
p0 <- cbind(nd, Count = predict(m0, newdata = nd, type = "count"), Zero = predict(m0, newdata = nd, type = "zero"))
p1 <- cbind(nd, Mean = predict(m1, newdata = nd, type="response"), SE = predict(m1, newdata = nd, type="response", se.fit=T)$se.fit)
p2 <- cbind(nd, Mean = predict(m2, newdata = nd, type="response"), SE = predict(m2, newdata = nd, type="response", se.fit=T)$se.fit)
答案 0 :(得分:0)
在下面的code
集下创建一个data
,它计算p0
,p1
,p2
。 nb dataframe
的创建方式与test dataframe
不同。
导入库
library(pscl)
library (MASS)
创建样本数据集
media <- c("other", "pic", "pw", "text", "web")
content <- c("cultura", "employ", "environment", "other", "security", "sport", "transport")
set.seed(1)
retweets <- floor(abs(1e4*rnorm(1000)))
temp_index <- which(retweets %in% sample(retweets, 20)) # sample indexes
retweets[temp_index] <- 0 # set some retweets to zero to run zeroinfl()
df <- data.frame(retweets)
df$media <- sample(media, 1000, replace = TRUE)
df$content <- sample(content, 1000, replace = TRUE)
head(df)
unique(df$media)
unique(df$content)
创建测试数据集
注意:此处,测试数据集是从训练数据中提取的,仅用于说明目的。理想情况下,它应该是一组新数据。
nd = df[sample(nrow(df), 300), ] # ideally this should not be from the train data, this is just for an example code
nd_X <- test[,c('media', 'content')]
nd_Y <- test[,c('retweets')]
适合的型号:zeroinf(dist='poisson')
,glm(family='poisson')
,glm.nb()
# Poisson
summary( m0 <- zeroinfl(retweets ~ media + content, data=df, dist="poisson") )
# Binomial
summary( m1 <- glm(formula=retweets ~ media + content, data=df, family="poisson"(link=log)))
# glm()
#summary( m2 <- glm.nb(retweets ~ media + content, data=df) ) # gives error in summary due to zeros
summary( m2 <- glm.nb(retweets ~ media + content, data=df[df$retweets!=0,]) ) # no error without zeros
Predict
使用test data
设置
p0 <- cbind(nd, Count = predict(m0, newdata = nd_X, type = "count"), Zero = predict(m0, newdata = nd, type = "zero"))
p1 <- cbind(nd, Mean = predict(m1, newdata = nd_X, type="response"), SE = predict(m1, newdata = nd, type="response", se.fit=T)$se.fit)
p2 <- cbind(nd, Mean = predict(m2, newdata = nd_X, type="response"), SE = predict(m2, newdata = nd, type="response", se.fit=T)$se.fit)
输出: