我有一个收银机交易数据库。记录按产品分为篮子分类:
Date Hour Cust Prod Basket Spend
1| 20160416 8 C1 P1 B2 10
2| 20160416 8 C1 P2 B2 20
3| 20160115 15 C1 P3 B1 30
4| 20160115 15 C1 P2 B1 50
5| 20161023 11 C1 P4 B3 60
我想看看:
DaysSinceLastVisit Cust Basket Spend
NULL C1 B1 30
92 C1 B2 80
190 C1 B3 60
和
AvgDaysBetweenVisits Cust AvgSpent
141 C1 56.57
我无法弄清楚如何在GROUP BY期间在日期上执行聚合函数。 SO上的所有其他帖子似乎都有2个开始/结束日期[1] [2] [3]。
这是我到目前为止所尝试的内容:
SELECT SUM(DATE(Date)), Cust, Basket, SUM(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # Sums the numeric values
SELECT DIFF(DATE(Date)), Cust, Basket, AVG(Spend) FROM 'a' GROUP BY CUST_CODE,BASKET # DIFF/DIFFERENCE not a function
另外,应该注意的是我在使用sqitef的r上运行它,它使用SQLite语法。但是,我更喜欢SQLite解决方案。
答案 0 :(得分:1)
day_since_last_visit是关于今天的日期+时间,因为它更实用。但是,如果你得到第1和第2,第2和第3之间的差异,它将是92和190,这与你的数据类似。处理该部分的最佳方法是光标,也可以在查询中完成,但会更复杂..
select round( julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) ,2 ) days_since_last_visit,
date, cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date
所访问日期的平均值和每条记录的今天日期
select round(avg( julian_days) ,2) average_days , cust, round(avg(total_spend) ,2) average_spent
from
( select julianday('now') - min ( julianday (substr(date,1,4) || "-"||substr(date,5,2) || "-"|| substr(date,7) ) ) julian_days, date,
cust, basket, sum(spend) total_spend
from customer
group by cust, basket, date )
group by cust
仅为参考
创建和插入脚本 create table customer ( date text , hour integer, cust text, prod text, basket text, spend integer )
insert into customer ( date, hour, cust, prod, basket, spend ) values ( "20161023", 11, "C1", "P4", "B3",60)
答案 1 :(得分:0)
这会在问题中使用通过sqldf的SQLite。
我们首先在with
子句中定义三个表(仅用于SQL语句的持续时间):
aa
是表格a
,其中包含适用于差异化的额外朱利安日期列tab_days
是一个使用aa
通过适当汇总的联接来定义差异天数的表格tab_sum_spend
是一个保存Spend
总和最后,我们加入最后两个并进行适当排序。
library(sqldf)
# see note at end for a in reproducible form
t1 <- sqldf("
WITH aa AS (SELECT julianday(substr(Date, 1, 4) || '-' ||
substr(Date, 5, 2) || '-' ||
substr(Date, 7, 2)) juldate,
*
FROM a),
tab_days AS (SELECT a1.Date, min(a1.juldate - a2.juldate) Days, a1.Cust, a1.Basket
FROM aa a1
LEFT JOIN aa a2 ON a1.Date > a2.Date AND a1.Cust = a2.Cust
GROUP BY a1.Cust, a1.Date, a1.Basket),
tab_sum_spend AS (SELECT Cust, Date, Basket, sum(Spend) Spend
FROM aa
GROUP BY Cust, Date, Basket)
SELECT Days, Cust, Basket, Spend
FROM tab_days
JOIN tab_sum_spend USING(Cust, Date, Basket)
ORDER BY Cust, Date, Basket
")
t1
## Days Cust Basket Spend
## 1 <NA> C1 B1 80
## 2 92.0 C1 B2 30
## 3 190.0 C1 B3 60
和第二个问题:
sqldf("SELECT avg(Days) AvgDays, Cust, avg(Spend) AvgSpend FROM t1")
## AvgDays Cust AvgSpend
## 1 141 C1 56.66667
注意:可重现形式的data.frame a
是:
Lines <- "Date Hour Cust Prod Basket Spend
1 20160416 8 C1 P1 B2 10
2 20160416 8 C1 P2 B2 20
3 20160115 15 C1 P3 B1 30
4 20160115 15 C1 P2 B1 50
5 20161023 11 C1 P4 B3 60"
a <- read.table(text = Lines, as.is = TRUE)