我已经确定,如果不是我自己创建的,几周前在StackOverflow上从一个慷慨的响应者那里得到的一些好的代码就可以解决这个问题了,今天我可以使用一些新的帮助。
示例数据(下面称为对象eh
):
ID 2013-03-20 2013-04-09 2013-04-11 2013-04-17 2013-04-25 2013-05-15 2013-05-24 2013-05-25 2013-05-26
5167f 0 0 0 0 0 0 0 0 0
1214m 0 0 0 0 0 0 0 0 0
1844f 0 0 0 0 0 0 0 0 0
2113m 0 0 0 0 0 0 0 0 0
2254m 0 0 0 0 0 0 0 0 0
2721f 0 0 0 0 0 0 0 0 0
3121f 0 0 0 0 0 0 0 0 0
3486f 0 0 0 0 0 0 0 0 0
3540f 0 0 0 0 0 0 0 0 0
4175m 0 0 0 0 0 0 0 0 0
我需要能够将0s
和1s
分组到各自列日期的时间段(例如,每1,2,3或4周)。每当1
在特定日期范围(Period
)内至少失一次,那么1
中的ID
汇总Period
(0
1}},否则)。
我从1周的汇总例程开始作为例子。我的主要问题是,在时间序列Periods
到"2013-03-20"
期间,生成的最终输出缺少一些可能的1周"2015-12-31"
。
请注意,在此示例输出中,其中行用于唯一IDs
,列用于唯一Periods
,Periods
2,5,7和9如何丢失:
1 3 4 6 8 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
以下是对原始数据框进行分组的完整例程(参见上面共享的示例数据):
#Convert to data table from original data frame, eh
dt <- as.data.table(eh)
#One week summarized encounter histories
dt_merge <- data_frame(
# Create a column showing the beginning date
Date1 = seq(from = ymd("2013-03-20"), to = ymd("2015-12-31"), by = "1 week")) %>%
# Create a column showing the end date of each period
mutate(Date2 = lead(Date1)) %>%
# Adjust Date1
mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
# Remove the last row
drop_na(Date2) %>%
# Create date list
mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
unnest() %>%
# Create Group ID
mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
# Create Period ID
mutate(Period = paste0(RunID)) %>%
# Add a column showing Month
mutate(Month = month(Dates)) %>%
# Add a column showing Year
mutate(Year = year(Dates)) %>%
# Add a column showing season
mutate(Season = case_when(
Month %in% 3:5 ~ "Spring",
Month %in% 6:8 ~ "Summer",
Month %in% 9:11 ~ "Fall",
Month %in% c(12, 1, 2) ~ "Winter",
TRUE ~ NA_character_
)) %>%
# Combine Season and Year
mutate(SeasonYear = paste0(Season, Year)) %>%
select(-Date1, -Date2, -RunID)
dt2 <- dt %>%
# Reshape the data frame
gather(Date, Value, -ID) %>%
# Convert Date to date class
mutate(Date = ymd(Date)) %>%
# Join dt_merge
left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
group_by(ID, Period) %>%
summarise(Value = max(Value)) %>%
spread(Period, Value)
#Finished product
one.week <- as.data.frame(one.week)
#Missing weeks 2, 5, 7, and 9...
one.week
有人可以帮我理解我哪里出错吗?提前谢谢!
-AD
答案 0 :(得分:2)
这种情况正在发生,因为eh
数据中缺少这几周。例如,如果您查看构成第2周的日期:
dt_merge %>%
filter(Period == 2)
#> # A tibble: 7 x 6
#> Dates Period Month Year Season SeasonYear
#> <date> <chr> <dbl> <dbl> <chr> <chr>
#> 1 2013-03-28 2 3 2013 Spring Spring2013
#> 2 2013-03-29 2 3 2013 Spring Spring2013
#> 3 2013-03-30 2 3 2013 Spring Spring2013
#> 4 2013-03-31 2 3 2013 Spring Spring2013
#> 5 2013-04-01 2 4 2013 Spring Spring2013
#> 6 2013-04-02 2 4 2013 Spring Spring2013
#> 7 2013-04-03 2 4 2013 Spring Spring2013
您可以看到这些日期都不在eh
的列中,从2013-03-20跳到2013-04-09。由于您在创建left_join
时使用dt2
,因此只保留eh
中的日期(以及周数)。
可以使用 tidyr 包中的complete()
来更正此问题,以创建ID和日期的缺失组合。
dt2 <- dt %>%
# Reshape the data frame
gather(Date, Value, -ID) %>%
# Convert Date to date class
mutate(Date = ymd(Date)) %>%
# Create missing ID/Date combinations
complete(ID, Date = dt_merge$Dates) %>%
# Join dt_merge
left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
mutate(Period = as.numeric(Period)) %>%
group_by(ID, Period) %>%
summarise(Value = max(Value, na.rm = TRUE)) %>%
spread(Period, Value)
one.week
#> # A tibble: 10 x 146
#> # Groups: ID [10]
#> ID `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
#> * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1214m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 2 1844f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 3 2113m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 4 2254m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 5 2721f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 6 3121f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 7 3486f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 8 3540f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 9 4175m 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> 10 5167f 0 -Inf 0 0 -Inf 0 -Inf 0 -Inf 0 -Inf
#> # ... with 134 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>,
#> # `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
#> # `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>,
#> # `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#> # `30` <dbl>, `31` <dbl>, `32` <dbl>, `33` <dbl>, `34` <dbl>,
#> # `35` <dbl>, `36` <dbl>, `37` <dbl>, `38` <dbl>, `39` <dbl>,
#> # `40` <dbl>, `41` <dbl>, `42` <dbl>, `43` <dbl>, `44` <dbl>,
#> # `45` <dbl>, `46` <dbl>, `47` <dbl>, `48` <dbl>, `49` <dbl>,
#> # `50` <dbl>, `51` <dbl>, `52` <dbl>, `53` <dbl>, `54` <dbl>,
#> # `55` <dbl>, `56` <dbl>, `57` <dbl>, `58` <dbl>, `59` <dbl>,
#> # `60` <dbl>, `61` <dbl>, `62` <dbl>, `63` <dbl>, `64` <dbl>,
#> # `65` <dbl>, `66` <dbl>, `67` <dbl>, `68` <dbl>, `69` <dbl>,
#> # `70` <dbl>, `71` <dbl>, `72` <dbl>, `73` <dbl>, `74` <dbl>,
#> # `75` <dbl>, `76` <dbl>, `77` <dbl>, `78` <dbl>, `79` <dbl>,
#> # `80` <dbl>, `81` <dbl>, `82` <dbl>, `83` <dbl>, `84` <dbl>,
#> # `85` <dbl>, `86` <dbl>, `87` <dbl>, `88` <dbl>, `89` <dbl>,
#> # `90` <dbl>, `91` <dbl>, `92` <dbl>, `93` <dbl>, `94` <dbl>,
#> # `95` <dbl>, `96` <dbl>, `97` <dbl>, `98` <dbl>, `99` <dbl>,
#> # `100` <dbl>, `101` <dbl>, `102` <dbl>, `103` <dbl>, `104` <dbl>,
#> # `105` <dbl>, `106` <dbl>, `107` <dbl>, `108` <dbl>, `109` <dbl>,
#> # `110` <dbl>, `111` <dbl>, ...
如果给定周内没有该ID的值,则返回-Inf
。或者,不是使用NA
填充缺失值,而是使用complete(ID, Date = dt_merge$Dates, fill = list(Value = 0))
填充,例如0。这将使任何未观察到的ID和日期组合的值变量为0.