由于有许多出色的Stackoverflow帖子,我有一个解决方案来填充时间序列数据的缺失行。但是我主要关心的是是否有任何方法可以使它更简洁,更简短。我正在处理如下数据:
df <- data.frame(
id = c("A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C"),
week = c(-13, -2, 4, 5, 6, 3, 4, 5, -8, -5, 3),
last_week = c(6, 6, 6, 6, 6, 5, 5, 5, 3, 3, 3),
first_week = c(-20, -20, -20, -20, -20, 2, 2, 2, -3, -3, -3),
dv = c(3, 2, 2, 1, 4, 5, 2, 3, 1, 1, 2)
)
我的目标是三个方面:
1)如果first_week
小于-10,则我应该让每行从-10到last_week
开始。即ID A应该在-10到6周内有一行。
2)如果first_week
大于0,我应该让每行从1到last_week
开始。即ID B应该在1到5周内有一行。
3)对于所有其他情况,我应该让每行都从first_week
到last_week
开始。也就是说,ID C的行应在-3到3周之间。
现在,我的解决方案如下:
loop_for_filling <- function(df){
for(i in unique(df$id)){
current_id_df <- filter(df, id == i)
current_id_last_week <- unique(current_id_df$last_week)
current_id_first_week <- unique(current_id_df$first_week)
# Create a sequence of weeks to be filled
if(current_id_first_week > 0){
all_weeks = seq(1, current_id_last_week)
} else if(current_id_first_week < -10){
all_weeks = seq(-10, current_id_last_week)
} else{
all_weeks = seq(current_id_first_week, current_id_last_week)
current_id_df = filter(current_id_df, week >= first_week)
}
# Create a dataframe with rows for every week btwn last_week and first_week
current_id_all <- data.frame(list(week = all_weeks)) %>% mutate(id = i)
# Merge two dataframes
current_id_new_df <- merge(current_id_df, current_id_all, all = T) %>%
subset(., select = -c(last_week, first_week)) %>%
filter(week >= -10)
# Bind current_person_new_dfs
if(i == unique(df$id)[[1]]){all_file <- current_id_new_df}
if(i != unique(df$id)[[1]]){all_file <- rbind(all_file, current_id_new_df)}
}
all_file
}
df2 <- loop_for_filling(df)
df2
这当然可以,但是我正在处理一个大型数据集(5万个ID),我想知道是否有任何方法可以用更短,更简洁的方式处理此问题,所以我不需要盯着我的循环三个小时:)
谢谢!
答案 0 :(得分:1)
我认为这将运行得更快。首先,我将应用指定的调整来确定每个id
所应显示的周数范围。然后,我使用tidyr :: uncount()为每个所需的id-week组合创建行。最后,我加入了原始数据。
library(tidyverse)
df_ranges <- df %>%
distinct(id, first_week, last_week) %>%
mutate(first_week = case_when(first_week < -10 ~ -10,
first_week > 0 ~ 1,
TRUE ~ first_week)) %>%
mutate(week_count = last_week - first_week + 1)
df2b <- df_ranges %>%
uncount(week_count, .id = "week") %>%
mutate(week = first_week + week - 1) %>%
select(id, week) %>%
left_join(df %>% select(id, week, dv))
identical(df2b, df2)
#[1] TRUE