这是帖子Remove the first row from each group if the second row meets a condition
的继续提出的问题以下是样本数据集:
recurFinder
如下所示:
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
如果两个连续行之间的差<= 5,我需要保留那些记录,其中基于每个买方和id,连续行之间的金额总和> 5000。因此,例如,标识为“ 4”的买方“ Sandy”在“ 6/15/2018”和“ 6/20/2018”之间有两笔交易,分别是1849年和4193,交易间隔为5天,因为这些交易的总和如果两个数量> 5000,则输出将具有这些记录。而对于同一个ID为'4'的买家'Sandy','8/17/2018','8/20/2018'和'8/23/2018'的另一笔交易分别为4256、65和100最多3天,但输出不会包含这些记录,因为此金额的总和<5000。 最终输出如下所示:
| id | Date | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9 | 11/29/2018 | John | NA | 959 |
| 9 | 11/29/2018 | John | 0 | 1158 |
| 9 | 11/29/2018 | John | 0 | 596 |
| 5 | 2/13/2019 | Maria | 76 | 922 |
| 5 | 2/13/2019 | Maria | 0 | 922 |
| 4 | 6/15/2018 | Sandy | -243 | 1849 |
| 4 | 6/20/2018 | Sandy | 5 | 4193 |
| 4 | 8/17/2018 | Sandy | 58 | 4256 |
| 4 | 8/20/2018 | Sandy | 3 | 65 |
| 4 | 8/23/2018 | Sandy | 3 | 100 |
| 20 | 12/25/2018 | Paul | 124 | 313 |
| 20 | 12/25/2018 | Paul | 0 | 99 |
答案 0 :(得分:1)
df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
"6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"),
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"),
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>%
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y"))))
将日期从字符更改为日期,将金额从字符更改为数字:
df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)
现在,我在这里按ID对数据集进行分组,并按日期进行排列,并在每个ID中创建一个排名(例如,Sandy将在她购物的5天中,从1到5进行排名),然后我定义了一个名为ConsecutiveSum的新变量,该变量将每行的Value添加到前一行的Value(lag为您提供前一行)。如果前一行的值不存在,ifelse语句会强制连续的sum输出0。下一步就是执行条件:
df %>%
group_by(id) %>%
arrange(Date) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0)))%>%
filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
# id Date Buyer Amount diffs rank ConsecutiveSum
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl>
# 1 4 6/15/2018 Sandy 1849 NA 1 0
# 2 4 6/20/2018 Sandy 4193 5 2 6042
答案 1 :(得分:0)
我会结合使用@Configuration
@EnableRedisHttpSession
public class Config {
中可用的技术:
首先创建一个分组变量(tidyverse
),然后结合使用原始的new_id
和id
来基于分组进行加法运算。然后我们可以根据new_id
> 5000之和的标准来filter
。我们可以将其与Amount
然后filter
或join
进行过滤条件。
semi_join
是一个数据集,用于根据ids
时的Amount
和id
和new_id
来查找总数filter
。这样可以使Dollars > 5000
和id
符合您的条件
new_id