根据条件汇总列

时间:2020-10-10 12:49:33

标签: r tidyverse

我有一个由三列组成的数据框:x,ID和date_time。 “ x”列是变量x的记录,ID指示要记录的内容,而date_time指示何时。请参见下面的数据框。

从这个数据框中,我想计算一个新的数据框,该数据框包含七个列:“测量”,“ ID”和“日期”,“ x_4_10_day”,“ Day_total”,“ x_4_10_night”,“ Night_total”。

  1. “测量”。该列应说明给定ID的数字量度。测量从23:00:00开始,直到第二天22:59:59。但是,测量是在随机时间开始的,因此第一次测量的持续时间不是24小时。最后一次测量也不是24小时。
  2. “ ID”。指示给定度量的ID
  3. “日期”。此列应以yyyy.mm.dd格式显示给定测量中最后一次记录的日期。
  4. “ x_4_10_day”:度量标准分为一天(7:00:00-22:59:59)和一夜(23:00:00-6:59:59)。此列应指示在给定的度量中每天的总时间(以分钟为单位)x介于4-10(均包括在内)之间。 x介于4-10之间的记录可以视为x在4-10之间持续5分钟,因为每次记录之间有5分钟。
  5. “ Day_total”:此列应指示一天中测量的总时间(以分钟为单位)x。 x中缺少应减去的值。 x的缺失值留为空白。对于每次丢失的测量,应从总时间中减去5分钟。此外,一些测量是在7:00以后开始的。
  6. “ x_4_10_night”:此列应指示在给定测量中,每晚x的总时间(以分钟为单位)在4-10之间(包括两者)。
  7. “ Night_total”:此列应指示夜间测量的总时间(以分钟为单位)x。 x中缺少应减去的值。 x的缺失值留为空白。对于每次丢失的测量,应从总时间中减去5分钟。

每个唯一的测量都应该有一行。到目前为止,我有一个代码可以正确返回以下列:“测量”,“ ID”和“日期”:

df1$mydate = as.Date(df1$date_time, format = "%Y.%m.%d %H:%M:%S")
df1$tm <- as.numeric(df1$date_time)
df1$dts <- 86400*as.numeric(df1$mydate)
df2 <- df1 %>% 
group_by(ID,mydate) %>% 
transform(date = case_when(((dts-3600)<tm & tm<(dts+82800)) ~paste0(mydate), ((dts+82800)<=tm) ~paste0(mydate+1) )) %>% 
select(ID,date) %>%   
unique() %>% 
group_by(ID) %>% 
mutate(measurement = row_number())

但是我不知道怎么做最后一个。

这是预期的输出:

dummy_output <- read.table(header=TRUE, text ="
                     ID Date        Measurement x_4_10_day Day_total x_4_10_night Night_total
                     12 2020.03.02  1           30         40        0            0
                     12 2020.03.03  2           0          0         45           75
                     13 2020.05.09  1           90         90        0            0
") 

任何建议都非常感谢,谢谢!

这是数据:

structure(list(date_time = c("2020.03.02 22:00:17", "2020.03.02 22:05:17", 
"2020.03.02 22:10:17", "2020.03.02 22:35:17", "2020.03.02 22:40:17", 
"2020.03.02 22:45:17", "2020.03.02 22:50:17", "2020.03.02 22:55:17", 
"2020.03.02 23:00:17", "2020.03.02 23:05:17", "2020.03.02 23:10:17", 
"2020.03.02 23:15:17", "2020.03.02 23:20:17", "2020.03.02 23:25:17", 
"2020.03.02 23:30:17", "2020.03.02 23:35:17", "2020.03.02 23:40:17", 
"2020.03.02 23:45:17", "2020.03.02 23:50:17", "2020.03.02 23:55:17", 
"2020.03.03 00:00:17", "2020.03.03 00:55:17", "2020.03.03 01:00:17", 
"2020.03.03 01:05:17", "2020.03.03 01:10:17", "2020.03.03 01:15:17", 
"2020.03.03 01:20:17", "2020.03.03 01:25:17", "2020.05.09 08:39:32", 
"2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
"2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
"2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
"2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
"2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
"2020.05.09 08:39:32", "2020.05.09 08:39:32"), id = c(12L, 12L, 
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 
13L, 13L, 13L, 13L, 13L), x = c("7.55", "4.55", "4.55", "12", 
"12", "10", "10", "4.3", "", "", "4.3", "4.3", "4.3", "", "4.3", 
"12", "12", "12", "2", "12", "12", "", "8", "3", "3", "2", "2", 
"", "12", "10", "10", "4.3", "4.3", "4.3", "4.3", "4.3", "4.3", 
"4.3", "4.3", "12", "12", "12", "12", "12", "12", "12")), row.names = c(NA, 
46L), class = "data.frame")

2 个答案:

答案 0 :(得分:1)

我已将id=14仅包含夜间值添加到您的数据框中。也许这就是您想要的。请注意,您的期望值不完全符合您的要求。

df11 <- structure(list(date_time = c("2020.03.02 22:00:17", "2020.03.02 22:05:17", 
                             "2020.03.02 22:10:17", "2020.03.02 22:35:17", "2020.03.02 22:40:17", 
                             "2020.03.02 22:45:17", "2020.03.02 22:50:17", "2020.03.02 22:55:17", 
                             "2020.03.02 23:00:17", "2020.03.02 23:05:17", "2020.03.02 23:10:17", 
                             "2020.03.02 23:15:17", "2020.03.02 23:20:17", "2020.03.02 23:25:17", 
                             "2020.03.02 23:30:17", "2020.03.02 23:35:17", "2020.03.02 23:40:17", 
                             "2020.03.02 23:45:17", "2020.03.02 23:50:17", "2020.03.02 23:55:17", 
                             "2020.03.03 00:00:17", "2020.03.03 00:55:17", "2020.03.03 01:00:17", 
                             "2020.03.03 01:05:17", "2020.03.03 01:10:17", "2020.03.03 01:15:17", 
                             "2020.03.03 01:20:17", "2020.03.03 01:25:17", "2020.05.09 08:39:32", 
                             "2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
                             "2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
                             "2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
                             "2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
                             "2020.05.09 08:39:32", "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
                             "2020.05.09 08:39:32", "2020.05.09 08:39:32", 
                             "2020.03.02 23:45:17", "2020.03.02 23:50:17", "2020.03.02 23:55:17", 
                             "2020.03.03 00:00:17", "2020.03.03 00:55:17", "2020.03.03 01:00:17" 
                             ), 
                      x = c("7.55", "4.55", "4.55", "12", 
                            "12", "10", "10", "4.3", "", "", "4.3", "4.3", "4.3", "", "4.3", 
                            "12", "12", "12", "2", "12", "12", "", "8", "3", "3", "2", "2", 
                            "", "12", "10", "10", "4.3", "4.3", "4.3", "4.3", "4.3", "4.3", 
                            "4.3", "4.3", "12", "12", "12", "12", "12", "12", "12",
                            "12", "10", "10", "4.3", "4.3", "4.3"),
               id = c(12L, 12L, 
                      12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
                      12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
                      13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 
                      13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L)), 
               row.names = c(NA, 52L), class = "data.frame")

df11$xn <- as.numeric(df11$x)
df1 <- df11 %>% transform(xmin = ifelse((xn<4 | xn>10 | is.na(xn)),0,5 ),
                          xmint = ifelse(is.na(xn),-5,5 ))
df1$dateTime = as_datetime(df1$date_time, format = "%Y.%m.%d %H:%M:%S")
df1$mydate = as.Date(df1$date_time, format = "%Y.%m.%d %H:%M:%S")

df1$tm <- as.numeric(df1$dateTime)
df1$dts <- 86400*as.numeric(df1$mydate)

df2 <- df1 %>% group_by(id,mydate) %>% 
         transform(date = case_when(((dts-3600)<tm & tm<(dts+82800) )~paste0(mydate),((dts+82800)<=tm)~paste0(mydate+1) )) %>%
         transform(dayrnight = ifelse((tm>=(dts+25200) & tm<(dts+82800) ),'day','night' ) ) %>% 
         group_by(id,date,dayrnight) %>% 
         dplyr::summarise(x_4_10 = sum(xmin), total = sum(xmint)) %>% 
         pivot_wider(id_cols = c(id,date), names_from = dayrnight, values_from = c("x_4_10", "total")) %>% 
         mutate_if(is.numeric , replace_na, replace = 0) %>% 
         group_by(id) %>% mutate(measurement = row_number()) %>% 
         select(id,date,measurement,x_4_10_day,total_day,x_4_10_night,total_night)

> df2
# A tibble: 4 x 7
# Groups:   id [3]
     id date       measurement x_4_10_day total_day x_4_10_night total_night
  <int> <chr>            <int>      <dbl>     <dbl>        <dbl>       <dbl>
1    12 2020-03-02           1         30        40            0           0
2    12 2020-03-03           2          0         0           25          50
3    13 2020-05-09           1         50        90            0           0
4    14 2020-03-03           1          0         0           25          30

答案 1 :(得分:1)

我花了一些时间,但也许你想要这个

样本数据(与13中的日期/时间稍有不同

df <- structure(list(date_time = c("2020.03.02 22:00:17", "2020.03.02 22:05:17", 
                             "2020.03.02 22:10:17", "2020.03.02 22:35:17", "2020.03.02 22:40:17", 
                             "2020.03.02 22:45:17", "2020.03.02 22:50:17", "2020.03.02 22:55:17", 
                             "2020.03.02 23:00:17", "2020.03.02 23:05:17", "2020.03.02 23:10:17", 
                             "2020.03.02 23:15:17", "2020.03.02 23:20:17", "2020.03.02 23:25:17", 
                             "2020.03.02 23:30:17", "2020.03.02 23:35:17", "2020.03.02 23:40:17", 
                             "2020.03.02 23:45:17", "2020.03.02 23:50:17", "2020.03.02 23:55:17", 
                             "2020.03.03 00:00:17", "2020.03.03 00:55:17", "2020.03.03 01:00:17", 
                             "2020.03.03 01:05:17", "2020.03.03 01:10:17", "2020.03.03 01:15:17", 
                             "2020.03.03 01:20:17", "2020.03.03 01:25:17", "2020.05.09 08:39:32", 
                             "2020.05.09 08:44:32", "2020.05.09 08:49:32", "2020.05.09 08:54:32", 
                             "2020.05.09 08:59:32", "2020.05.09 09:39:32", "2020.05.09 09:44:32", 
                             "2020.05.09 09:49:32", "2020.05.09 09:59:32", "2020.05.09 10:39:32", 
                             "2020.05.09 11:39:32", "2020.05.09 12:39:32", "2020.05.09 13:39:32", 
                             "2020.05.09 14:39:32", "2020.05.09 15:39:32", "2020.05.09 16:39:32", 
                             "2020.05.09 17:39:32", "2020.05.09 18:39:32"), id = c(12L, 12L, 
                                                                                   12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
                                                                                   12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
                                                                                   13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 
                                                                                   13L, 13L, 13L, 13L, 13L), x = c("7.55", "4.55", "4.55", "12", 
                                                                                                                   "12", "10", "10", "4.3", "", "", "4.3", "4.3", "4.3", "", "4.3", 
                                                                                                                   "12", "12", "12", "2", "12", "12", "", "8", "3", "3", "2", "2", 
                                                                                                                   "", "12", "10", "10", "4.3", "4.3", "4.3", "4.3", "4.3", "4.3", 
                                                                                                                   "4.3", "4.3", "12", "12", "12", "12", "12", "12", "12")), row.names = c(NA, 
                                                                                                                                                                                           46L), class = "data.frame")

编辑结果

library(tidyverse)
library(lubridate)

df %>% as_tibble() %>%
  transform(x = as.numeric(x), 
            date_time = as_datetime(date_time),
            id = as.character(id)) %>%
  mutate(d_n = ifelse(hour(date_time)>=7 & hour(date_time)<23, 'day', 'night'),
         Date = as.Date(date_time, format = "%Y.%m.%d %H:%M:%S"),
         valid_m = ifelse(x>=4 & x<= 10, 1, 0)) %>%
  mutate(valid_m = ifelse(is.na(valid_m), 0, valid_m)) %>% #valid measurements
  arrange(id, date_time) %>%
  group_by(id) %>%
  mutate(validm_d = as.numeric(lead(date_time)-date_time)) %>%
  filter(!is.na(validm_d)) %>%
  group_by(id, Date, d_n, valid_m) %>%
  summarise(x_tm = sum(validm_d)) %>%
  ungroup() %>%
  pivot_wider(names_from = d_n, values_from = x_tm, values_fill =0) %>%
  group_by(id, Date) %>%
  mutate(day_t = sum(day), night_t = sum(night)) %>% 
  filter(valid_m != 0) %>%
  group_by(id) %>%
  mutate(measurement = row_number()) %>%
  select(id, measurement, Date, x_4_10_day =day, x_4_10_total =day_t, 
         x_4_10_night =night, x_4_10_totaln = night_t)

desired_result

id    measurement Date       x_4_10_day x_4_10_total x_4_10_night x_4_10_totaln
  <chr>       <int> <date>          <dbl>        <dbl>        <dbl>         <dbl>
1 12              1 2020-03-02         50           60           20            60
2 12              2 2020-03-03          0            0            5            85
3 13              1 2020-05-09        235          600            0             0

在此解决方案中,我不确定每次测量的持续时间,因此删除了每个测量的最后一个值。您可以适当地更改代码。基本上,“日”的最后一次测量结束了2300小时,因此第一行的结果应该比所示的结果少17秒。