我正在尝试捕获文档中某个主题的显着性或主导性。显着性的量度是该主题上的单词数。但是,我需要控制每个文档的单词数量不同的事实。 (TOTAL_WORDS,平均值= 2,444个字,标准差= 1,379个字,最小值= 561,最大值= 8,342个字,范围= 7,781个字)。如果我使用负二项式模型(glm.nb),Total_Words应该是偏移量还是权重?其次,如果我使用Total_Words作为偏移量,那么它是否像Poisson回归一样是偏移量的对数?
我尝试运行带有偏移量或权重的模型,但得到的结果却大不相同,只有在使用权重时,我的系数才具有统计意义。我查看了该软件包的文档,并说:“对于二项式GLM,当响应是成功的比例时,先验权重用于给出试验次数”。这是否意味着在我的情况下权重会被接受?
summary(m1 <- glm.nb(Problem_Demand ~ HEALTH_CJ + offset(log(`TOTAL WORDS`))))
summary(m2 <- glm.nb(Problem_Demand ~ HEALTH_CJ, weights=Dissertation_Dataset$`TOTAL WORDS`))
抵消结果:
Call:
glm.nb(formula = Problem_Demand ~ HEALTH_CJ +
offset(log(`TOTAL WORDS`)), init.theta = 0.1490825725,
link = log)
残差:
Min 1Q Median 3Q Max
-1.55538 -1.41229 -0.45314 0.00276 1.87925
系数:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.5384 0.2897 -8.762 <2e-16
HEALTH_CJLaw Enforcement -0.6883 0.4796 -1.435 0.151
HEALTH_CJOther 0.3187 0.6031 0.529 0.597
(Dispersion parameter for Negative Binomial(0.1491) family taken to be 1)
Null deviance: 154.04 on 149 degrees of freedom
Residual deviance: 151.23 on 147 degrees of freedom
AIC: 1400
Number of Fisher Scoring iterations: 1
Theta: 0.1491
Std. Err.: 0.0183
2 x log-likelihood: -1391.9620
重量结果:
Call:
glm.nb(formula = Problem_Demand ~ HEALTH_CJ,
weights = `TOTAL WORDS`, init.theta = 0.1458893113,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-121.467 -62.381 -21.260 -3.179 108.458
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.297791 0.005737 923.48 <2e-16
HEALTH_CJLaw Enforcement -1.163340 0.009350 -124.42 <2e-16
HEALTH_CJOther 0.529726 0.014012 37.81 <2e-16
(Dispersion parameter for Negative Binomial(0.1459) family taken to be 1)
Null deviance: 391806 on 149 degrees of freedom
Residual deviance: 373685 on 147 degrees of freedom
AIC: 3483728
Number of Fisher Scoring iterations: 1
Theta: 0.145889
Std. Err.: 0.000362
2 x log-likelihood: -3483720.172000