使用大查询safe_divide逻辑操作数据时如何停止数据重复?

时间:2019-07-02 20:19:31

标签: sql google-cloud-platform google-bigquery

我的问题是,在大查询#标准SQL语句中添加了一些逻辑(safe_divide)之后,我开始接收重复数据。仅在我添加此行后才会出现此问题

SAFE_DIVIDE( u.weekly_capacity/25200, 1) AS TargetDailyHours

如果我不能解决这个问题,我可能只需要在Data Studio中编写所有逻辑,因为当前的工作流程是Harvest-> Stitch-> Bigquery-> data studio

在此查询中,我使用表time_entires在MAX(updated_at)或最近时间条目上的左联接,到表users的当前用户处于活动状态的完全联接。我希望实际操作数据,以便可以找到FTE的实际工作小时数/ weekly_capacity。但是,只要我编写逻辑或大型查询函数,结果都会重复?


SELECT DISTINCT outer_e.hours, outer_e.id, outer_e.updated_at, 
                outer_e.spent_date, outer_e.created_at, 
                outer_e.client_id, outer_e.user_id AS harvest_userid,
                u.is_admin, u.first_name, u.is_active, u.id AS user_id, 
                u.weekly_capacity,
                client.name as names,

--SAFE_DIVIDE( u.weekly_capacity /25200, 1) AS TargetDailyHours

FROM
  (SELECT  e.id, MAX(e.updated_at) AS updated_at FROM `harvest-experiment.harvest.time_entries` AS e   
  GROUP BY e.id LIMIT 1000
  ) AS inner_e

LEFT JOIN `harvest-experiment.harvest.time_entries` AS outer_e
ON inner_e.id = outer_e.id AND inner_e.updated_at = outer_e.updated_at
FULL JOIN ( SELECT DISTINCT id, first_name, weekly_capacity, is_active, is_admin FROM `harvest-experiment.harvest.users`WHERE is_active = true
) AS u
ON outer_e.user_id = u.id  

JOIN (SELECT DISTINCT id , 
         name FROM `harvest-experiment.harvest.clients`) AS client
ON outer_e.client_id = client.id 



结果中的“列”每周工作量将开始显示例如具有不同每周工作量数字的人

Row hours   id  updated_at  spent_date  created_at  client_id   harvest_userid  is_admin    first_name  is_active   user_id weekly_capacity TargetDailyHours    

1   
0.22
995005338
2019-05-07 15:14:13 UTC
2019-04-29 00:00:00 UTC
2019-04-29 15:30:40 UTC
6864491
2622223
false
Nolan
true
2622223
72000
2.857142857142857


2   
0.22
995005338
2019-05-07 15:14:13 UTC
2019-04-29 00:00:00 UTC
2019-04-29 15:30:40 UTC
6864491
2622223
false
Nolan
true
2622223
129600
5.142857142857143


在此结果中,用户Nolan将显示两次条目,其序号为995005338,时长为0.22小时,而Weekly_capacity的数量将从ROW:2中的129600更改为ROW:1中的72000

1 个答案:

答案 0 :(得分:0)

实际的问题出在u.weekly_capacity列上,对于同一用户,它具有两个或多个不同的值。 SAFE_DIVIDE操作仅反映此问题。

您可以将此重复值跟踪到“ u”子查询:

SELECT DISTINCT id, first_name, weekly_capacity, is_active, is_admin 
    FROM `harvest-experiment.harvest.users`
    WHERE is_active = true

用户表包含两行或多行具有相同ID的行,其中is_active=true。这似乎与数据有关,因此为了避免重复的行,您必须确定要保留的值是哪一行。例如,如果您只想保留最大值,则可以使用GROUP BY:

SELECT id, first_name, MAX(weekly_capacity) as weekly_capacity, is_active, is_admin
    FROM `harvest-experiment.harvest.users`
    WHERE is_active = true
    GROUP BY id, first_name, is_active, is_admin

另外,如果您的用户表具有足够的信息,则可以使用其他列来进一步缩小结果

例如:

...
LEFT JOIN `harvest-experiment.harvest.time_entries` AS outer_e
    ON inner_e.id = outer_e.id AND inner_e.updated_at = outer_e.updated_at
FULL JOIN ( 
    SELECT DISTINCT id, first_name, weekly_capacity, is_active, is_admin, last_updated
        FROM `harvest-experiment.harvest.users` WHERE is_active = true
) AS u
ON outer_e.user_id = u.id AND outer_e.updated_at = u.last_updated
...