我有一些看起来像这样的数据
# A tibble: 6 x 3
Time Date Weather
<chr> <date> <chr>
1 "7:00 " 2010-01-01 Passing clouds
2 "7:30 " 2010-01-01 Passing clouds
3 "8:00 " 2010-01-01 Passing clouds
4 "8:30 " 2010-01-01 Passing clouds
5 "9:00 " 2010-01-01 Partly sunny
6 "9:30 " 2010-01-01 Drizzle Partly sunny
每天都有每小时的数据。我正在尝试将其分解为每日系列,并创建一些虚拟变量,但不要每隔30分钟间隔一次。
也就是说,当我当前创建虚拟变量时,它创建了太多列。这就是为什么我试图根据某种条件将其折叠的原因。条件是如果Weather
具有4个连续的观测值,则保持不变。即Passing clouds
具有4个连续的Weather
条件,但Partly sunny
没有,Drizzle Partly sunny
也是如此。
我目前有以下内容:
library(splitstackshape)
df %>%
group_by(Date) %>%
arrange(Weather) %>%
distinct(Weather) %>%
summarise(text = paste(Weather, collapse = "_")) %>%
cSplit_e(., split.col = "text", sep = "_", type = "character",
mode = "binary", fixed = TRUE, fill = 0)
但这对Weather
列中的所有唯一观察结果均是如此,这给了我太多列。因此,我试图添加一个条件,以仅保留具有4个或更多连续观察值的列。
数据:
df <- structure(list(Time = c("7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ",
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30",
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ",
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:00 ",
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30",
"11:00", "11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ",
"2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ",
"6:00 ", "6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ",
"9:30 ", "10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ",
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30",
"1:00 ", "1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ",
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ",
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 ", "7:30 ",
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30", "11:00",
"11:30", "12:00", "12:30", "1:00 ", "1:30 ", "2:00 ", "2:30 ",
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ",
"6:30 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ",
"10:00", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ",
"10:00", "10:30", "11:00", "11:30", "12:00", "12:30", "1:00 ",
"1:30 ", "2:00 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ", "4:30 ",
"5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ", "7:00 ",
"7:30 ", "8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "10:30",
"11:00", "11:30", "12:00", "1:00 ", "1:30 ", "2:00 ", "2:30 ",
"3:00 ", "3:30 ", "4:00 ", "4:30 ", "5:00 ", "5:30 ", "6:00 ",
"6:30 ", "7:00 ", "7:00 ", "7:30 ", "8:00 ", "8:30 ", "9:00 ",
"9:30 ", "10:00", "10:30", "11:00", "11:30", "12:00", "12:30",
"1:00 ", "1:30 ", "2:05 ", "2:30 ", "3:00 ", "3:30 ", "4:00 ",
"4:30 ", "5:00 ", "5:30 ", "6:00 ", "6:30 ", "7:00 ", "7:30 ",
"8:00 ", "8:30 ", "9:00 ", "9:30 ", "10:00", "7:00 "), Date = structure(c(14610,
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610,
14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610, 14610,
14610, 14610, 14610, 14610, 14610, 14610, 14611, 14611, 14611,
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611,
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611,
14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611, 14611,
14611, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612,
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612,
14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612, 14612,
14612, 14612, 14612, 14612, 14612, 14613, 14613, 14613, 14613,
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613,
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613,
14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613, 14613,
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614,
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614,
14614, 14614, 14614, 14614, 14614, 14614, 14614, 14614, 14615,
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615,
14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615, 14615,
14615, 14615, 14615, 14615, 14615, 14616, 14616, 14616, 14616,
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616,
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616,
14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616, 14616,
14617), class = "Date"), Weather = c("Passing clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Partly sunny", "Drizzle Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Partly sunny", "Drizzle Partly sunny", "Drizzle Partly sunny",
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds",
"Passing clouds", "Passing clouds", "Passing clouds", "Fog",
"Passing clouds", "Passing clouds", "Light fog", "Scattered clouds",
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Broken clouds", "Partly cloudy", "Partly cloudy", "Partly cloudy",
"Partly cloudy", "Passing clouds", "Passing clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Partly sunny", "Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Rain Partly sunny", "Rain Partly sunny", "Rain Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Partly sunny", "Passing clouds", "Passing clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Passing clouds", "Passing clouds",
"Passing clouds", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog",
"Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Drizzle Fog", "Fog",
"Fog", "Fog", "Fog", "Light rain Fog", "Light rain Fog", "Rain Fog",
"Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Rain Fog", "Fog",
"Partly sunny", "Broken clouds", "Broken clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Passing clouds", "Light rain Passing clouds",
"Passing clouds", "Passing clouds", "Passing clouds", "Fog",
"Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Fog", "Partly sunny",
"Broken clouds", "Broken clouds", "Broken clouds", "Broken clouds",
"Broken clouds", "Broken clouds", "Broken clouds", "Partly sunny",
"Partly sunny", "Partly sunny", "Partly sunny", "Partly sunny",
"Scattered clouds", "Passing clouds", "Passing clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Passing clouds", "Partly cloudy",
"Broken clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds",
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds",
"Scattered clouds", "Broken clouds", "Broken clouds", "Broken clouds",
"Scattered clouds", "Scattered clouds", "Scattered clouds", "Scattered clouds",
"Scattered clouds", "Scattered clouds", "Passing clouds", "Passing clouds",
"Rain Low clouds", "Rain Low clouds", "Rain Low clouds", "Rain Low clouds",
"Light rain Mostly cloudy", "Light rain Mostly cloudy", "Light rain Mostly cloudy",
"Light rain Mostly cloudy", "Rain Low clouds", "Light rain Mostly cloudy",
"Light rain Mostly cloudy", "Rain Mostly cloudy", "Snow Mostly cloudy",
"Snow Mostly cloudy", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog",
"Snow Ice fog", "Snow Ice fog", "Snow Ice fog", "Snow Ice fog",
"Snow Ice fog", "Light snow Ice fog", "Light snow Ice fog", "Ice fog",
"Passing clouds", "Partly cloudy", "Passing clouds", "Passing clouds",
"Passing clouds", "Passing clouds", "Passing clouds")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -200L))
答案 0 :(得分:1)
您在寻找这样的东西吗?
library(dplyr)df_new <- df %>%
group_by(Date) %>%
mutate(repeated = rep(rle(Weather)$lengths, rle(Weather)$lengths)) %>%
filter(repeated >= 4)
df_new
#> # A tibble: 148 x 4
#> # Groups: Date [7]
#> Time Date Weather repeated
#> <chr> <date> <chr> <int>
#> 1 "7:00 " 2010-01-01 Passing clouds 4
#> 2 "7:30 " 2010-01-01 Passing clouds 4
#> 3 "8:00 " 2010-01-01 Passing clouds 4
#> 4 "8:30 " 2010-01-01 Passing clouds 4
#> 5 10:00 2010-01-01 Partly sunny 10
#> 6 10:30 2010-01-01 Partly sunny 10
#> 7 11:00 2010-01-01 Partly sunny 10
#> 8 11:30 2010-01-01 Partly sunny 10
#> 9 12:00 2010-01-01 Partly sunny 10
#> 10 12:30 2010-01-01 Partly sunny 10
#> # … with 138 more rows
df_new %>%
summarise(text = paste(unique(Weather), collapse = "_"))
#> # A tibble: 7 x 2
#> Date text
#> <date> <chr>
#> 1 2010-01-01 Passing clouds_Partly sunny_Scattered clouds
#> 2 2010-01-02 Scattered clouds_Partly sunny_Partly cloudy_Passing clouds
#> 3 2010-01-03 Passing clouds_Partly sunny
#> 4 2010-01-04 Drizzle Fog_Fog_Rain Fog_Passing clouds
#> 5 2010-01-05 Fog_Broken clouds_Partly sunny
#> 6 2010-01-06 Scattered clouds
#> 7 2010-01-07 Rain Low clouds_Light rain Mostly cloudy_Snow Ice fog_Passing…
由reprex package(v0.3.0)于2019-11-25创建
rle
对重复的连续值进行计数。我将其包装在rep
中,以使其适合mutate
调用,但是您也可以单独运行它以掌握其工作原理(如果尚未使用的话)知道)。一旦您知道每个值重复的频率,就很容易先进行filter
然后是summarise
。