R循环未遍历每个数据组

时间:2018-07-27 13:50:11

标签: r loops

我正在尝试通过分组为最佳拟合模型运行一个循环。我到了似乎无法让循环单独运行的地步-它循环并按预期输出多个csv,但是每个文件中的数据都相同:

library(leaps)
library(dplyr)

#data
df = data.frame(matrix(rnorm(80), nrow=10))
df$state <- c('AL','AK','AR','AZ','CT')
state_list <- c('AL','AK','AR','AZ','CT')

for (state in state_list){
  data_filter <- subset(df, state = state)
  data_filter_u <- data_filter[c(1,2,3,4,5,6,7,8,9)]
  data_sub <- regsubsets(X8~., data_filter_u, nvmax = 8)
  data_summary <- summary(data_sub)
  data_coef <- coef(data_sub,which.max(data_summary$adjr2))
  as.data.frame(t(data_coef))
  data_coef$state_used <- state
  write.csv(data_coef,paste0(unique(state),".csv"))
}

但是-我为每个文件获得相同的数据(相同的截距,使用的变量和系数),并且它创建了两个意外列,即'stateAr','stateAZ','stateCT'。

+---+--------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+------------+
|   | X.Intercept. |     X2      |     X3      |     X4      |     X5      |     X7      |   stateAR    |   stateAZ    |   stateCT    | state_used |
+---+--------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+------------+
| 1 |  1.027070119 | 0.593400469 | 0.852107976 | 0.219067212 | 0.447761824 | 0.213681166 | -3.421259006 | -2.250303456 | -0.558997077 | AL         |
+---+--------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+------------+

我正在尝试接收类似的信息,只是循环通过状态,并根据最佳匹配情况选择适当的列:

+---+--------------+-------------+-------------+-------------+-------------+-------------+------------+
|   | X.Intercept. |     X2      |     X3      |     X4      |     X5      |     X7      | state_used |
+---+--------------+-------------+-------------+-------------+-------------+-------------+------------+
| 1 |  1.027070119 | 0.593400469 | 0.852107976 | 0.219067212 | 0.447761824 | 0.213681166 | AL         |
+---+--------------+-------------+-------------+-------------+-------------+-------------+------------+

感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

如果我的理解是正确的,那么我认为您的subset命令无法执行您打算执行的操作。您可能会使用类似的

df[df$state == state, ]

通过循环的当前组来子集data.frame。如果您有几种情况,则可以使用类似的

df[df$state %in% c("AL", "AK"), ]

关于速度的旁注,我认为不使用base::subset的直接子集通常会更快(如果我错了,请任何人纠正我)。请参见下面的基准作为示例。如果您的数据真的很大,您可能会考虑使用data.table,它甚至更快。但是,由于data.table的开销,这对于您的非常小的数据集是没有意义的。

df = data.frame(matrix(rnorm(80), nrow=10))
df$state <- c('AL','AK','AR','AZ','CT')
state_list <- c('AL','AK','AR','AZ','CT')
microbenchmark::microbenchmark(
(a = subset(df,state == "AL"))
,(b =df[df$state == "AL", ])
)
# Unit: microseconds
#                            expr     min       lq      mean   median       uq     max neval cld
# (a = subset(df, state == "AL")) 118.031 121.1885 128.32595 123.1625 125.9260 273.167   100  b
# (b = df[df$state == "AL", ])     92.372  95.9250  99.84874  97.1090  99.4775 215.139   100  a 
all.equal(a,b)
# [1] TRUE