如果连续行之间的差满足条件,则计算列的总和

时间:2019-09-13 18:02:23

标签: r group-by sum

这是帖子Remove the first row from each group if the second row meets a condition

的继续提出的问题

以下是样本数据集:

recurFinder

如下所示:

df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
       Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
       "6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"), 
Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"), 
Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>% 
group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) 

如果两个连续行之间的差<= 5,我需要保留那些记录,其中基于每个买方和id,连续行之间的金额总和> 5000。因此,例如,标识为“ 4”的买方“ Sandy”在“ 6/15/2018”和“ 6/20/2018”之间有两笔交易,分别是1849年和4193,交易间隔为5天,因为这些交易的总和如果两个数量> 5000,则输出将具有这些记录。而对于同一个ID为'4'的买家'Sandy','8/17/2018','8/20/2018'和'8/23/2018'的另一笔交易分别为4256、65和100最多3天,但输出不会包含这些记录,因为此金额的总和<5000。 最终输出如下所示:

| id |    Date    | Buyer | diff | Amount |
|----|:----------:|------:|------|--------|
| 9  | 11/29/2018 |  John | NA   | 959    |
| 9  | 11/29/2018 |  John | 0    | 1158   |
| 9  | 11/29/2018 |  John | 0    | 596    |
| 5  | 2/13/2019  | Maria | 76   | 922    |
| 5  | 2/13/2019  | Maria | 0    | 922    |
| 4  | 6/15/2018  | Sandy | -243 | 1849   |
| 4  | 6/20/2018  | Sandy | 5    | 4193   |
| 4  | 8/17/2018  | Sandy | 58   | 4256   |
| 4  | 8/20/2018  | Sandy | 3    | 65     |
| 4  | 8/23/2018  | Sandy | 3    | 100    |
| 20 | 12/25/2018 | Paul  | 124  | 313    |
| 20 | 12/25/2018 | Paul  | 0    | 99     |

2 个答案:

答案 0 :(得分:1)

df <- data.frame(id=c("9","9","9","5","5","4","4","4","4","4","20","20"),
                 Date=c("11/29/2018","11/29/2018","11/29/2018","2/13/2019","2/13/2019",
                        "6/15/2018","6/20/2018","8/17/2018","8/20/2018","8/23/2018","12/25/2018","12/25/2018"), 
                 Buyer= c("John","John","John","Maria","Maria","Sandy","Sandy","Sandy","Sandy","Sandy","Paul","Paul"), 
                 Amount= c("959","1158","596","922","922","1849","4193","4256","65","100","313","99"), stringsAsFactors = F) %>% 
  group_by(Buyer,id) %>% mutate(diffs = c(NA, diff(as.Date(Date, format = "%m/%d/%Y")))) 

将日期从字符更改为日期,将金额从字符更改为数字:

df$Date<-as.Date(df$Date, '%m/%d/%y')
df$Amount<-as.numeric(df$Amount)

现在,我在这里按ID对数据集进行分组,并按日期进行排列,并在每个ID中创建一个排名(例如,Sandy将在她购物的5天中,从1到5进行排名),然后我定义了一个名为ConsecutiveSum的新变量,该变量将每行的Value添加到前一行的Value(lag为您提供前一行)。如果前一行的值不存在,ifelse语句会强制连续的sum输出0。下一步就是执行条件:

df %>%
  group_by(id) %>%
    arrange(Date) %>%
      mutate(rank=dense_rank(Date)) %>% 
        mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount  + lag(Amount , default = 0)))%>%
         filter(diffs<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)


# id    Date      Buyer Amount diffs  rank ConsecutiveSum
#   <chr> <chr>     <chr>  <dbl> <dbl> <int>          <dbl>
# 1 4     6/15/2018 Sandy   1849    NA     1              0
# 2 4     6/20/2018 Sandy   4193     5     2           6042

答案 1 :(得分:0)

我会结合使用@Configuration @EnableRedisHttpSession public class Config { 中可用的技术:

首先创建一个分组变量(tidyverse),然后结合使用原始的new_idid来基于分组进行加法运算。然后我们可以根据new_id> 5000之和的标准来filter。我们可以将其与Amount然后filterjoin进行过滤条件。

semi_join是一个数据集,用于根据ids时的Amountidnew_id来查找总数filter。这样可以使Dollars > 5000id符合您的条件

new_id