如何设置数据以在日常时间序列中使用交互术语

时间:2015-03-23 15:33:12

标签: r interaction

我每天都有针对所有原因的死亡时间序列,并根据不同的疾病类别进行分层,并希望了解死亡和pm10之间的关联是否会被各种疾病类别修改。在示例数据中,死亡代表每日总死亡,cvd因心脏病死亡,“其他”代表所有不是由心脏病引起的死亡。为了模拟pm10与各种结果之间的关联,我使用以下脚本。

m1<-gam(death ~ pm10 + s(trend)+ s(temp), data=df1, na.action=na.omit, family=poisson)
m2<-gam(cvd ~ pm10 + s(trend)+ s(temp), data=df1, na.action=na.omit, family=poisson)
m3<-gam(others ~ pm10 + s(trend)+ s(temp), data=df1, na.action=na.omit, family=poisson)

每天都有死亡人数,其中大部分是死因。 1987年1月1日,有130人死亡(65人死于CVD,65人死于其他原因)。我的目的是通过暴露于PM10来确定CVD组中的死亡和其他原因是否存在差异。研究问题是:当接触PM10时,CVD和其他人的死亡率是否不同。在分层分析中,我可以将数据分成CVD组和其他组。但是在这个任务中,我有兴趣使用交互术语来运行模型。但我无法弄清楚如何做到这一点。 我想两次扩展每一行并为两个组创建一个虚拟变量(1为其他人,0为CVD)和单个列(newdeath),每天包含两行代表死亡,因为其他人与CVD相比。通过该设置(数据集df2如下所示),我想运行以下代码:

minter<-gam(newdeath~ pm10*dummy  + s(trend)+ s(temp), data=df2, na.action=na.omit, family=poisson)

但是我不确定这种数据形式和模型是否真的能让我实现我想要的目标。

以下代码将生成样本数据集

library(mgcv) 
require(dlnm)
df <- chicagoNMMAPS
df <- chicagoNMMAPS
df1 <- df[,c("date","dow","death","cvd","temp","pm10")] 
df1$trend<-seq(dim(df1)[1]) 
df1$others<-df1$death-df1$cvd # all other non-CVD deaths

我已考虑设置以下日期来解决问题,但不确定是否正确。

> dput(df2)
structure(list(date = structure(c(6209, 6209, 6210, 6210, 6211, 
6211, 6212, 6212, 6213, 6213), class = "Date"), dow = structure(c(5L, 
5L, 6L, 6L, 7L, 7L, 1L, 1L, 2L, 2L), .Label = c("Sunday", "Monday", 
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"), class = "factor"), 
    death = c(130L, 130L, 150L, 150L, 101L, 101L, 135L, 135L, 
    126L, 126L), cvd = c(65L, 65L, 73L, 73L, 43L, 43L, 72L, 72L, 
    64L, 64L), temp = c(-0.277777777777778, -0.277777777777778, 
    0.555555555555556, 0.555555555555556, 0.555555555555556, 
    0.555555555555556, -1.66666666666667, -1.66666666666667, 
    0, 0), pm10 = c(26.956073733, 26.956073733, NA, NA, 32.838694951, 
    32.838694951, 39.9560737332, 39.9560737332, NA, NA), trend = c(1L, 
    1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L), newdeath = c(65L, 65L, 
    77L, 73L, 58L, 43L, 63L, 72L, 62L, 64L), dummy = c(1, 0, 
    1, 0, 1, 0, 1, 0, 1, 0)), datalabel = "Written by R.              ", time.stamp = "24 Mar 2015 00:00", .Names = c("date", 
"dow", "death", "cvd", "temp", "pm10", "trend", "newdeath", "dummy"
), formats = c("%dD_m_Y", "%9.0g", "%9.0g", "%9.0g", "%9.0g", 
"%9.0g", "%9.0g", "%9.0g", "%9.0g"), types = c(255L, 253L, 253L, 
253L, 255L, 255L, 253L, 253L, 254L), val.labels = c("", "dow", 
"", "", "", "", "", "", ""), var.labels = c("date", "dow", "death", 
"cvd", "temp", "pm10", "trend", "others", ""), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"), version = 12L, label.table = structure(list(
    dow = structure(1:7, .Names = c("Sunday", "Monday", "Tuesday", 
    "Wednesday", "Thursday", "Friday", "Saturday"))), .Names = "dow"), class = "data.frame")

1 个答案:

答案 0 :(得分:0)

我同意@BondedDust ...似乎您的数据已经过多地聚合以回答您的问题,而且还有您想要的工具。然而,你可以衡量的是pm10与心脏病死亡比例之间的关系:

df1$prop.cvd <- df1$cvd / df1$death

然后想象

plot(prop.cvd ~ pm10, data = df1)

也许在回归模型中使用此变量作为响应变量,就像这样 - 但问题与您的不同,并没有考虑pm10和时间效果的任何延迟。为此,您需要其他工具,但我无法在此帮助您。也许在Cross-Validated上提问可以帮助你进一步发展。

model <- glm(prop.cvd ~ pm10 + temp, data = df1)
summary(model)

Call:
glm(formula = prop.cvd ~ pm10 + temp, data = df1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.1632  -0.0351  -0.0014   0.0349   0.3323  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.435e-01  1.529e-03 290.051   <2e-16 ***
pm10         5.294e-05  4.161e-05   1.272    0.203    
temp        -6.436e-04  7.437e-05  -8.654   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.002732995)

    Null deviance: 13.498  on 4862  degrees of freedom
Residual deviance: 13.282  on 4860  degrees of freedom
  (251 observations deleted due to missingness)
AIC: -14898

Number of Fisher Scoring iterations: 2