Question

我有一个函数拟合分布并返回一个由分布名称，均值，sd等组成的向量。我正在测试一些发行版，但我不能依赖gofstat（），因为当有太多的零需要考虑时，它会发疯。

因此，我必须手动比较几个变量的AIC，确定哪些变量实际上是“fitdist”类，并返回具有最低AIC的变量的名称。一旦我有了，我计算平均值，sd等并返回。

目前代码如下：

library(fitdistrplus)

fit_distr <- function(data){

fe <- tryCatch(fitdist(data, "exp"), error = function(e) FALSE )
flogis <- tryCatch(fitdist(data, "logis"), error = function(e) FALSE )
fn <- tryCatch(fitdist(data, "norm"), error = function(e) FALSE)
fp <- tryCatch(fitdist(data, "pois"), error = function(e) FALSE)
fg <- tryCatch(fitdist(data, "gamma"), error = function(e) FALSE)

classFitDist <- c(class(fe), class(flogis), class(fn), class(fp),class(fg))

distributions <- classFitDist == "fitdist"


AIC <- data.frame()
if(class(fe)=="fitdist") {AIC[1,ncol(AIC)+1] <- fe$aic}
if(class(flogis)=="fitdist") {AIC[1,ncol(AIC)+1] <- flogis$aic}
if(class(fn)=="fitdist") {AIC[1,ncol(AIC)+1] <- fn$aic}
if(class(fp)=="fitdist") {AIC[1,ncol(AIC)+1] <- fp$aic}
if(class(fg)=="fitdist") {AIC[1,ncol(AIC)+1] <- fg$aic}

names(AIC) <- c("exp", "logis", "norm", "pois", "gamma")[distributions]

fit <- names(AIC[which.min(AIC)])


mean <- switch (fit,
                          exp = 1/fe$estimate[[1]],
                          logis = flogis$estimate[[1]],
                          norm = fn$estimate[[1]],
                          pois = fp$estimate[[1]],
                          gamma = fg$estimate[[1]]/fg$estimate[[2]]
          )

sd <- switch (fit,
                        exp = mean,
                        logis = (flogis$estimate[[2]]*pi)/sqrt(3),
                        norm = fn$estimate[[2]],
                        pois = sqrt(mean),
                        gamma = sqrt(fg$estimate[[1]]/(fg$estimate[[2]]^2))
          )

return(c(fit,mean,sd))
}

它有效，但要考虑的数千个样本非常慢。我欢迎任何建议如何优化它并使其“更清洁”和更快。

顺便说一下，这就是我之前所拥有的，但是就像我提到的那样 - 样本由太多的零组成，它有一个合适的（双关语无意间！）

goodnessoffit <- gofstat(list(fe, flogis, fn, fp, fg)[distributions],  fitnames = c("exp", "logis", "norm", "pois","gamma")[distributions])
fit <- names(which(goodnessoffit$aic == min(goodnessoffit$aic)))

ans [！test＆amp; amp; ok]＆lt; - rep（no，length.out = length（ans））[！test＆amp; ：替换的长度为零

Answer 1

这种方法的问题是fitdist效率低下。您需要通过编写更好的算法来找到更快的方法来查找AIC。一种方法是拟合glm。

AIC.fitdist <- function(x, ...) x$aic

x <- rnorm(100, mean=20)

AIC(fitdist(x, 'norm'))
AIC(glm(x ~ 1 , family=gaussian)) ## same

AIC(fitdist(x, 'gamma'))
AIC(glm(x ~ 1 , family=Gamma)) ## same

某些分析显示fitdist与glm具有相同的计算时间。这对fitdist来说是个坏消息，因为glm只是glm.fit的膨胀包装。使用glm.fit可以节省大量时间。最后，如果您真的必须削减时间（数百万而不是数千）的模型，您可以使用一步估算器

> benchmark(
+     fitdist(x, 'gamma'),
+     glm(x ~ 1, family=Gamma),
+     glm.fit(rep(1, length(x)), x, family=Gamma()),
+     glm.fit(rep(1, length(x)), x, family=Gamma(), control = glm.control(maxit=1))
+ )
                                                                               test replications elapsed relative user.self sys.self user.child
1                                                               fitdist(x, "gamma")          100    0.42    7.000      0.42        0         NA
2                                                        glm(x ~ 1, family = Gamma)          100    0.17    2.833      0.17        0         NA
3                                   glm.fit(rep(1, length(x)), x, family = Gamma())          100    0.06    1.000      0.07        0         NA
4 glm.fit(rep(1, length(x)), x, family = Gamma(), control = glm.control(maxit = 1))          100    0.06    1.000      0.06        0         NA
  sys.child
1        NA
2        NA
3        NA
4        NA

aic是glm.fit输出中的存储对象。

可以使用生存包中的survreg进行指数分布拟合：survreg(rep(1,100), x, dist='exponential)。

最后，由于这些都是常规的指数族，因此您可以使用足够的统计数据来得出概率分布。例如：

normaic <- function(x) {
  4 - 2*sum(dnorm(x, mean(x), sd(x), log=T))
}

> benchmark(normaic(x), glm.fit(rep(1, 100), x)$aic)
                         test replications elapsed relative user.self sys.self user.child sys.child
2 glm.fit(rep(1, 100), x)$aic          100    0.04       NA      0.05        0         NA        NA
1                  normaic(x)          100    0.00       NA      0.00        0         NA        NA

快速查找具有最低组件值的变量名称

1 个答案: