我有一个包含三年数据的df。
df <- data.table( YEAR = c("1999", "1999", "2000", "1999","2000",
"2000","1999", "2000", "2001", "2001", "2001", "2001"),
Sex=c("M", "F","F", "M","M", "F","F", "F", "M", "F","F", "M"),
V3 = c(1,2,3,4,5,6,7,8,9,10,11,12),
V4 = rnorm(12, mean = 0, sd = 1))
每年的行数是相同的。 我想创建3个线性回归模型,每年一次。每年设置的火车规模在几年中应该相同。例如,每年有3个实例用于训练,有1个实例用于测试。 我知道这样做一年有很多可能性,例如:
df_1999 <- df1 %>%
filter (YEAR == 1999)
samp <- sample(nrow(df_1999), 0.75 * nrow(df_1999))
train <- df[samp, ]
test <- df[-samp, ]
model_1999 <- lm(V4 ~ V3+ factor(Sex), data = df_1999)
但是我不知道如何一次适应所有lm模型。
答案 0 :(得分:1)
这是拆分和应用问题的一个很好的例子。我将使用split()
函数按年份划分原始数据帧,然后应用该函数对数据的每个子集执行线性回归。
df <- data.frame( YEAR = c("1999", "1999", "2000", "1999","2000",
"2000","1999", "2000", "2001", "2001", "2001", "2001"),
Sex=c("M", "F","F", "M","M", "F","F", "F", "M", "F","F", "M"),
V3 = c(1,2,3,4,5,6,7,8,9,10,11,12),
V4 = rnorm(12, mean = 0, sd = 1))
dfs<-split(df, df$YEAR)
set.seed(1)
lapply(dfs, function(df){
samp <- sample(nrow(df), 0.75 * nrow(df))
train <- df[samp, ]
test <- df[-samp, ]
model <- lm(V4 ~ V3+ factor(Sex), data = train)
})
请注意,由于此样本数据集很小,因此它的训练子集可能不包含变量的所有可能组合,并且可能会出错。