Question

我有一组包含年份，国家和公司标识符的数据。我想使用data.table将logit模型拟合到每个国家/地区子集。如果我在每个国家/地区子集中有足够的条目来适应模型，那么我没有问题，但如果年份 - 国家/地区子集中的数据不足，则glm会抛出错误，我可以'适合所有型号。（lm我的错误基本相同。）

data.table内是否有解决方案？或者我应该在上游修改我的数据，以确保没有没有足够数据的年份国家子集？

谢谢！

library(data.table)

# similar data
DT <- data.table(year=rep(2001:2010, each=100),
                 country=rep(rep(1:10, each=10), 10), 
                 firm=rep(1:100, 10), 
                 y=round(runif(100)), 
                 x=runif(100)
                 )
setkey(DT, year, country)

# no problems if there are enough data per year-country subset
DT2 <- DT[, as.list(coef(glm(y ~ x), family="binomial")), by="year,country"]

# but `lm` throws and error if there are missing data
DT[(DT$year == 2001) & (DT$country == 1), "y"] <- NA
DT3 <- DT[, as.list(coef(glm(y ~ x, family="binomial"))), by="year,country"]

产量

> DT3 <- DT[, as.list(coef(glm(y ~ x, family="binomial"))), by="year,country"]
Error in family$linkfun(mustart) : 
  Argument mu must be a nonempty numeric vector

Answer 1

这个怎么样？

fn <- function(x, y) {
  if (length(na.omit(y)) == 0)
    NULL
  else
    as.list(coef(glm(y ~ x, family="binomial")))
}

DT3 <- DT[, fn(x, y), by="year,country"]

显然，您可以针对特定目的在fn中定制错误检查。

<强>更新即可。如果您希望fn可能处理数据表中的多个列，则可以使用以下解决方案：

fn <- function(df) {
  if (length(na.omit(df$y)) == 0)
    NULL
  else
    as.list(coef(glm(df$y ~ df$x, family="binomial")))
}

DT3 <- DT[, fn(.SD), by="year,country"]

从马修编辑

这不是你应该如何使用data.table。无需定义函数。只需像这样直接使用变量：

DT3 <- DT[, 
  if (length(na.omit(y)) == 0)
    NULL
  else
    as.list(coef(glm(y ~ x, family="binomial")))
, by="year,country"]

除非您确实使用df$的所有列，否则fn()不建议在fn(.SD)内重复data.table并调用.SD。使用.SDcols。通常会有相当大的多行{ ... }作为j。

如何使用data.table处理空/不完整的子集

1 个答案: