在不同的时间段进行不同的计数时,结果会有所不同

时间:2018-10-13 18:15:49

标签: sql apache-spark apache-spark-sql bigdata qubole

我正在设法吸引不重复的访客。我首先按总数进行了检查,但没有按任何时间框架进行分隔。

主表(大数据表示例):

+-----------+----+-------+
|theDateTime|vD  | vis   |
+----------------+-------+
|2018-10-03 |123 |abc    |
|2018-10-04 |123 |abc    |
|2018-10-04 |123 |pqr    |
|2018-10-05 |123 |xyz    |
+-----------+----+-------+

以上所述的总非重复计数为3,但是当我按天abc分组时,该计数将被计数两次。首先在3号,然后在2号。我只想计算第一个。

我的总计查询:

select
  d.eId AS vD
  , COUNT(DISTINCT visitorId) AS vis
 from decisions  
 WHERE d.eId = 123 
 AND timestamp BETWEEN unix_timestamp('2018-10-03 00:00:00')*1000 AND 
 unix_timestamp('2018-10-06 12:17:00')*1000
 GROUP BY d.eId
 ORDER BY vId

我的结果:

+----+---------+
| vD | vis     |
+----+---------+
|123 | 3       |
+----+---------+

我的按天查询

select DISTINCT
cast(from_unixtime(timestamp DIV 1000) AS date) AS theDateTime
, d.eId AS vD
, COUNT(DISTINCT visitorId) AS vis
from decisions  
WHERE timestamp BETWEEN unix_timestamp('2018-10-03 00:00:00')*1000 AND 
unix_timestamp('2018-10-06 12:17:00')*1000
AND d.eId IN (11550123588)
GROUP BY cast(from_unixtime(timestamp DIV 1000) AS date), 
d.vD
ORDER BY vD, theDateTime  

我的结果:

+-----------+----+-------+
|theDateTime|vD  | vis   |
+----------------+-------+
|2018-10-03 |123 |   1   |
|2018-10-04 |123 |   2   |
|2018-10-05 |123 |   1   |
+-----------+----+-------+

总计为 1122585 。大于总和

我知道这是因为,以防万一访客在不同的日子被重复,而当我按日分组时,他被计数两次。如果访客已经在第一天被计算在内,我是否有办法在第二天不计算访客呢?

请帮助!

2 个答案:

答案 0 :(得分:0)

如果我正确理解了这一点,则只需要一个不同的数据视图即可。

val df = Seq(("2018-10-03",123,"abc"),
("2018-10-04",123,"abc"),
("2018-10-05",123,"pqr"),
("2018-10-06",123,"xyz")).toDF("theDateTime","vD","vis").withColumn("theDateTime", $"theDateTime".cast("timestamp"));

df.show

import org.apache.spark.sql.functions._
val df1 = df.groupBy("vis").pivot("vD").agg(min("theDateTime")).sort($"123")
df1.show

+---+-------------------+
|vis|                123|
+---+-------------------+
|abc|2018-10-03 00:00:00|
|pqr|2018-10-05 00:00:00|
|xyz|2018-10-06 00:00:00|
+---+-------------------+

现在,如果您将“ 123”分组,则可以每天获得唯一计数。 这有帮助吗?

答案 1 :(得分:0)

如果我理解正确,则可以使用子查询在SQL中进行此操作:

select min_dt, count(distinct visitorId) AS vis
from (select eid, vis, min(thedatetime) as min_dt
      from decisions d
      where d.eid = 123 and . . .
      group by vis, eid
     ) d
group by min_dt