我正在尝试通过分组为最佳拟合模型运行一个循环。我到了似乎无法让循环单独运行的地步-它循环并按预期输出多个csv,但是每个文件中的数据都相同:
library(leaps)
library(dplyr)
#data
df = data.frame(matrix(rnorm(80), nrow=10))
df$state <- c('AL','AK','AR','AZ','CT')
state_list <- c('AL','AK','AR','AZ','CT')
for (state in state_list){
data_filter <- subset(df, state = state)
data_filter_u <- data_filter[c(1,2,3,4,5,6,7,8,9)]
data_sub <- regsubsets(X8~., data_filter_u, nvmax = 8)
data_summary <- summary(data_sub)
data_coef <- coef(data_sub,which.max(data_summary$adjr2))
as.data.frame(t(data_coef))
data_coef$state_used <- state
write.csv(data_coef,paste0(unique(state),".csv"))
}
但是-我为每个文件获得相同的数据(相同的截距,使用的变量和系数),并且它创建了两个意外列,即'stateAr','stateAZ','stateCT'。
+---+--------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+------------+
| | X.Intercept. | X2 | X3 | X4 | X5 | X7 | stateAR | stateAZ | stateCT | state_used |
+---+--------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+------------+
| 1 | 1.027070119 | 0.593400469 | 0.852107976 | 0.219067212 | 0.447761824 | 0.213681166 | -3.421259006 | -2.250303456 | -0.558997077 | AL |
+---+--------------+-------------+-------------+-------------+-------------+-------------+--------------+--------------+--------------+------------+
我正在尝试接收类似的信息,只是循环通过状态,并根据最佳匹配情况选择适当的列:
+---+--------------+-------------+-------------+-------------+-------------+-------------+------------+
| | X.Intercept. | X2 | X3 | X4 | X5 | X7 | state_used |
+---+--------------+-------------+-------------+-------------+-------------+-------------+------------+
| 1 | 1.027070119 | 0.593400469 | 0.852107976 | 0.219067212 | 0.447761824 | 0.213681166 | AL |
+---+--------------+-------------+-------------+-------------+-------------+-------------+------------+
感谢您的帮助。
答案 0 :(得分:1)
如果我的理解是正确的,那么我认为您的subset命令无法执行您打算执行的操作。您可能会使用类似的
df[df$state == state, ]
通过循环的当前组来子集data.frame
。如果您有几种情况,则可以使用类似的
df[df$state %in% c("AL", "AK"), ]
关于速度的旁注,我认为不使用base::subset
的直接子集通常会更快(如果我错了,请任何人纠正我)。请参见下面的基准作为示例。如果您的数据真的很大,您可能会考虑使用data.table
,它甚至更快。但是,由于data.table
的开销,这对于您的非常小的数据集是没有意义的。
df = data.frame(matrix(rnorm(80), nrow=10))
df$state <- c('AL','AK','AR','AZ','CT')
state_list <- c('AL','AK','AR','AZ','CT')
microbenchmark::microbenchmark(
(a = subset(df,state == "AL"))
,(b =df[df$state == "AL", ])
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# (a = subset(df, state == "AL")) 118.031 121.1885 128.32595 123.1625 125.9260 273.167 100 b
# (b = df[df$state == "AL", ]) 92.372 95.9250 99.84874 97.1090 99.4775 215.139 100 a
all.equal(a,b)
# [1] TRUE