根据字符串条件将每小时数据分解为每日数据并创建虚拟变量

时间:2019-11-25 14:47:34

标签: r

我有一些看起来像这样的数据

# A tibble: 6 x 3
  Time    Date       Weather             
  <chr>   <date>     <chr>               
1 "7:00 " 2010-01-01 Passing clouds      
2 "7:30 " 2010-01-01 Passing clouds      
3 "8:00 " 2010-01-01 Passing clouds      
4 "8:30 " 2010-01-01 Passing clouds      
5 "9:00 " 2010-01-01 Partly sunny        
6 "9:30 " 2010-01-01 Drizzle Partly sunny

每天都有每小时的数据。我正在尝试将其分解为每日系列,并创建一些虚拟变量,但不要每隔30分钟间隔一次。

也就是说,当我当前创建虚拟变量时,它创建了太多列。这就是为什么我试图根据某种条件将其折叠的原因。条件是如果Weather具有4个连续的观测值,则保持不变。即Passing clouds具有4个连续的Weather条件,但Partly sunny没有,Drizzle Partly sunny也是如此。

我目前有以下内容:

library(splitstackshape)
df %>% 
  group_by(Date) %>% 
  arrange(Weather) %>% 
  distinct(Weather) %>% 
  summarise(text = paste(Weather, collapse = "_")) %>% 
  cSplit_e(., split.col = "text", sep = "_", type = "character",
           mode = "binary", fixed = TRUE, fill = 0)

但这对Weather列中的所有唯一观察结果均是如此,这给了我太多列。因此,我试图添加一个条件,以仅保留具有4个或更多连续观察值的列。

数据:

df <- structure(list(Time = c("7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:00 ", 
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", 
"11:00", "11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", 
"2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", 
"6:00 ", "6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", "11:00", 
"11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", "2:30 ", 
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ", 
"6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", 
"10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", 
"10:00", "10:30", "11:00", "11:30", "12:00", "12:30", "1:00 ", 
"1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", 
"5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", "7:00 ", 
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", 
"11:00", "11:30", "12:00", "1:00 ", "1:30 ", "2:00 ", "2:30 ", 
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ", 
"6:30 ", "7:00 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", 
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30", 
"1:00 ", "1:30 ", "2:05 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", 
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", 
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 "), Date = structure(c(14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 
14610, 14610, 14610, 14610, 14610, 14610, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 
14611, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 
14612, 14612, 14612, 14612, 14612, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14615, 
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 
14615, 14615, 14615, 14615, 14615, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 
14617), class = "Date"), Weather = c("Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Partly sunny", "Drizzle Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Drizzle Partly sunny", "Drizzle Partly sunny", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Fog", 
"Passing clouds", "Passing clouds", "Light fog", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Broken clouds", "Partly cloudy", "Partly cloudy", "Partly cloudy", 
"Partly cloudy", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Rain Partly sunny", "Rain Partly sunny", "Rain Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Partly sunny", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", 
"Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Fog", 
"Fog", "Fog", "Fog", "Light rain Fog", "Light rain Fog", "Rain Fog", 
"Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Fog", 
"Partly sunny", "Broken clouds", "Broken clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Light rain Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Fog", 
"Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Partly sunny", 
"Broken clouds", "Broken clouds", "Broken clouds", "Broken clouds", 
"Broken clouds", "Broken clouds", "Broken clouds", "Partly sunny", 
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny", 
"Scattered clouds", "Passing clouds", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds", "Partly cloudy", 
"Broken clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Broken clouds", "Broken clouds", "Broken clouds", 
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds", 
"Scattered clouds", "Scattered clouds", "Passing clouds", "Passing clouds", 
"Rain Low clouds", "Rain Low clouds", "Rain Low clouds", "Rain Low clouds", 
"Light rain Mostly cloudy", "Light rain Mostly cloudy", "Light rain Mostly cloudy", 
"Light rain Mostly cloudy", "Rain Low clouds", "Light rain Mostly cloudy", 
"Light rain Mostly cloudy", "Rain Mostly cloudy", "Snow Mostly cloudy", 
"Snow Mostly cloudy", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog", 
"Snow Ice fog", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog", 
"Snow Ice fog", "Light snow Ice fog", "Light snow Ice fog", "Ice fog", 
"Passing clouds", "Partly cloudy", "Passing clouds", "Passing clouds", 
"Passing clouds", "Passing clouds", "Passing clouds")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -200L))

1 个答案:

答案 0 :(得分:1)

您在寻找这样的东西吗?

library(dplyr)df_new <- df %>% 
  group_by(Date) %>% 
  mutate(repeated = rep(rle(Weather)$lengths, rle(Weather)$lengths)) %>% 
  filter(repeated >= 4)

df_new
#> # A tibble: 148 x 4
#> # Groups:   Date [7]
#>    Time    Date       Weather        repeated
#>    <chr>   <date>     <chr>             <int>
#>  1 "7:00 " 2010-01-01 Passing clouds        4
#>  2 "7:30 " 2010-01-01 Passing clouds        4
#>  3 "8:00 " 2010-01-01 Passing clouds        4
#>  4 "8:30 " 2010-01-01 Passing clouds        4
#>  5 10:00   2010-01-01 Partly sunny         10
#>  6 10:30   2010-01-01 Partly sunny         10
#>  7 11:00   2010-01-01 Partly sunny         10
#>  8 11:30   2010-01-01 Partly sunny         10
#>  9 12:00   2010-01-01 Partly sunny         10
#> 10 12:30   2010-01-01 Partly sunny         10
#> # … with 138 more rows

df_new %>% 
  summarise(text = paste(unique(Weather), collapse = "_")) 
#> # A tibble: 7 x 2
#>   Date       text                                                          
#>   <date>     <chr>                                                         
#> 1 2010-01-01 Passing clouds_Partly sunny_Scattered clouds                  
#> 2 2010-01-02 Scattered clouds_Partly sunny_Partly cloudy_Passing clouds    
#> 3 2010-01-03 Passing clouds_Partly sunny                                   
#> 4 2010-01-04 Drizzle Fog_Fog_Rain Fog_Passing clouds                       
#> 5 2010-01-05 Fog_Broken clouds_Partly sunny                                
#> 6 2010-01-06 Scattered clouds                                              
#> 7 2010-01-07 Rain Low clouds_Light rain Mostly cloudy_Snow Ice fog_Passing…

reprex package(v0.3.0)于2019-11-25创建

rle对重复的连续值进行计数。我将其包装在rep中,以使其适合mutate调用,但是您也可以单独运行它以掌握其工作原理(如果尚未使用的话)知道)。一旦您知道每个值重复的频率,就很容易先进行filter然后是summarise