如何按天差异和平均距离汇总日期?

时间:2016-12-02 23:02:16

标签: r sqlite sqldf

我有一个收银机交易数据库。记录按产品分为篮子分类:

     Date    Hour  Cust  Prod Basket Spend
1| 20160416    8    C1    P1    B2     10
2| 20160416    8    C1    P2    B2     20
3| 20160115   15    C1    P3    B1     30
4| 20160115   15    C1    P2    B1     50
5| 20161023   11    C1    P4    B3     60

我想看看:

DaysSinceLastVisit  Cust Basket Spend
      NULL           C1    B1     30
        92           C1    B2     80
       190           C1    B3     60

AvgDaysBetweenVisits Cust AvgSpent
          141         C1    56.57

我无法弄清楚如何在GROUP BY期间在日期上执行聚合函数。 SO上的所有其他帖子似乎都有2个开始/结束日期[1] [2] [3]。

这是我到目前为止所尝试的内容:

SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function

另外,应该注意的是我在使用sqitef的r上运行它,它使用SQLite语法。但是,我更喜欢SQLite解决方案。

2 个答案:

答案 0 :(得分:1)

Query 1

Query 2

day_since_last_visit是关于今天的日期+时间,因为它更实用。但是,如果你得到第1和第2,第2和第3之间的差异,它将是92和190,这与你的数据类似。处理该部分的最佳方法是光标,也可以在查询中完成,但会更复杂..

   select   round( julianday('now')  - min (   julianday (substr(date,1,4)  || "-"||substr(date,5,2)  || "-"|| substr(date,7) )  ) ,2 )      days_since_last_visit,
           date, cust, basket, sum(spend) total_spend 
     from customer
 group by  cust, basket, date

所访问日期的平均值和每条记录的今天日期

   select  round(avg( julian_days) ,2)  average_days , cust,   round(avg(total_spend) ,2)  average_spent
     from 
           ( select   julianday('now')  - min (   julianday (substr(date,1,4)  || "-"||substr(date,5,2)  || "-"|| substr(date,7) )  )      julian_days, date,
                      cust, basket, sum(spend) total_spend
               from customer
           group by  cust, basket, date )
 group by cust 

仅为参考

创建和插入脚本
 create table customer ( date text , hour  integer, cust text, prod text, basket text, spend integer )

 insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)

答案 1 :(得分:0)

这会在问题中使用通过sqldf的SQLite。

我们首先在with子句中定义三个表(仅用于SQL语句的持续时间):

  1. aa是表格a,其中包含适用于差异化的额外朱利安日期列
  2. tab_days是一个使用aa通过适当汇总的联接来定义差异天数的表格
  3. tab_sum_spend是一个保存Spend总和
  4. 的表格

    最后,我们加入最后两个并进行适当排序。

    library(sqldf) 
    # see note at end for a in reproducible form
    
    t1 <- sqldf("
    WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
                                 substr(Date, 5, 2) || '-' ||
                                 substr(Date, 7, 2)) juldate, 
                       * 
                FROM a),     
         tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
                      FROM   aa a1
                              LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
                      GROUP  BY a1.Cust, a1.Date, a1.Basket),
         tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
                           FROM   aa
                           GROUP  BY Cust, Date, Basket) 
    SELECT Days, Cust, Basket, Spend
    FROM tab_days
    JOIN tab_sum_spend USING(Cust, Date, Basket)
    ORDER  BY Cust, Date, Basket
    ")
    t1
    
    ##    Days Cust Basket Spend
    ## 1  <NA>   C1     B1    80
    ## 2  92.0   C1     B2    30
    ## 3 190.0   C1     B3    60
    

    和第二个问题:

    sqldf("SELECT avg(Days)  AvgDays, Cust, avg(Spend) AvgSpend FROM   t1")
    ##   AvgDays Cust AvgSpend
    ## 1     141   C1 56.66667
    

    注意:可重现形式的data.frame a是:

    Lines <- "Date Hour Cust Prod Basket Spend
    1 20160416    8   C1   P1     B2    10
    2 20160416    8   C1   P2     B2    20
    3 20160115   15   C1   P3     B1    30
    4 20160115   15   C1   P2     B1    50
    5 20161023   11   C1   P4     B3    60"
    a <- read.table(text = Lines, as.is = TRUE)