我有一个数据集,在其他列中有date, sequence and low
列,请参见下面的df
。
1-to-9
列中的序列sequence
被视为一个块或一个完整周期
数据集具有几个这样的完整块/循环和部分完成的{/ {1}}
这就是我要解决的问题:
eg: 1-to-4
)如果存在两个具有相同值但在不同日期的低点,则 它应该只输出最新日期(请参见输出中的第三个方框)
df1
library(lubridate)
library(tidyverse)
### Sample data
df <- data.frame(stringsAsFactors=FALSE,
date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
"5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
"9/01/2019", "10/01/2019", "11/01/2019", "12/01/2019", "13/01/2019",
"14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019", "18/01/2019",
"19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
"23/01/2019", "24/01/2019", "25/01/2019", "26/01/2019", "27/01/2019",
"28/01/2019", "29/01/2019", "30/01/2019", "31/01/2019",
"1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8,
9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9),
low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49,
16, 79, 2, 74, 8, 88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9,
41, 9, 50)) %>% mutate(date = dmy(date))
我要的最终输出
df1 <- data.frame(stringsAsFactors=FALSE,
date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
"5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
"9/01/2019", "14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019",
"18/01/2019", "19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
"27/01/2019", "28/01/2019", "29/01/2019", "30/01/2019",
"31/01/2019", "1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3,
4, 5, 6, 7, 8, 9),
low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 1, 7, 49, 16, 79, 2, 74,
8, 88, 79, 51, 52, 47, 42, 9, 41, 9, 50),
group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3)) %>% mutate(date = dmy(date))
有什么想法吗?
附言我在格式化这个问题时遇到了一些问题,因此很不统一。
答案 0 :(得分:2)
我们通过取序列为1的累积总和来创建分组变量,然后filter
仅包含9个元素的组,而slice
在{{1之后, }}按arrange
结尾的顺序排列“日期”,以照顾到“最低”价值存在联系的情况
desc
或df %>%
group_by(group = cumsum(sequence == 1)) %>%
filter(n() == 9) %>%
select(date, low) %>%
arrange(desc(date)) %>%
slice(which.min(low)) %>%
ungroup %>%
select(-group)
# A tibble: 3 x 2
# date low
# <date> <dbl>
#1 2019-01-04 18
#2 2019-01-14 1
#3 2019-02-03 9
的类似选项
data.table
答案 1 :(得分:2)
另一种dplyr
可能是:
df %>%
group_by(group = cumsum(sequence == 1), rleid = with(rle(group), rep(seq_along(lengths), lengths))) %>%
filter(all(c(1:9) %in% sequence)) %>%
slice(which.min(rank(low, ties.method = "last"))) %>%
ungroup() %>%
select(-group, -rleid)
date sequence low
<date> <dbl> <dbl>
1 2019-01-04 4 18
2 2019-01-14 1 1
3 2019-02-03 8 9
在这里,首先,创建“ sequence” == 1的累积和以及基于该累积和的rleid()
类变量,然后按两者进行分组。其次,它消除了序列不包含所有九个值的情况。最后,在关系返回最后一个最小值的情况下,它返回每组的最小值(您可以通过参数ties.method
对其进行修改。)
答案 2 :(得分:1)
在基数R中也是可能的。不过,可能有点 map sy。
w <- which(df$sequence == 1)
w <- w[sapply(w, function(x) df$sequence[x + 8] == 9 & sum(df$sequence[x:(x + 8)]) == 45)]
do.call(rbind, Map(function(x) x[which.min(x$low), ],
Map(function(s) df[s, ], Map(seq, w, l=9))))
# date sequence low
# 4 2019-01-04 4 18
# 14 2019-01-14 1 1
# 32 2019-02-01 6 9
诀窍是找到完成的序列并将它们分组在列表中,然后rbind
which.min
每组。如果实际上没有错误的序列,则应该考虑sum(.) == 45
检查。
数据
df <- structure(list(date = structure(c(17897, 17898, 17899, 17900,
17901, 17902, 17903, 17904, 17905, 17906, 17907, 17908, 17909,
17910, 17911, 17912, 17913, 17914, 17915, 17916, 17917, 17918,
17919, 17920, 17921, 17922, 17923, 17924, 17925, 17926, 17927,
17928, 17929, 17930, 17931), class = "Date"), sequence = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9), low = c(96, 81, 43, 18,
43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49, 16, 79, 2, 74, 8,
88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9, 41, 9, 50)), row.names = c(NA,
-35L), class = "data.frame")