交叉加入结果

时间:2019-02-06 13:08:11

标签: sql amazon-redshift

  

映射:单个地址ID可以具有不同的跟踪ID。每个跟踪ID和每个地址ID将具有不同的经纬度对。每个跟踪ID可以具有多个路由ID,尽管在大多数情况下,它是跟踪ID映射的单个路由ID。

更新:我从T1_2中选择的跟踪ID在其他表中可能存在也可能不存在。另外,用于最终选择语句的每个临时表都没有重复项(基于键值)。

我对以下查询的结果有疑问。该查询应该为传递点与地址的距离偏差生成度量。它在列上执行一些交叉联接,因此数据比应有的更多。我知道这与粒度有关,这是一个基本错误,但对我来说很难找到错误之处。如果有人可以给我一些指示,请做。结果的一个子集已作为链接附加,并且我还强调了一个示例跟踪ID,该跟踪ID应该只出现一次(仅包含路由ID)。结果中应包含重复多次的地址ID,其中包含不重复的tracking_id。 turn应该与no_pkg列同步。该查询也随附以供参考。 Results subset

CREATE OR REPLACE FUNCTION f_stop_distance (Float, Float, Float, Float) /* This calculates distance in meters between two sets of lat and long */
       RETURNS FLOAT
     IMMUTABLE
     AS $$
       SELECT
          2 * 6373000 * ASIN( SQRT( ( SIN( RADIANS(($3 - $1) / 2) ) ) ^ 2 + COS(RADIANS($1)) * COS(RADIANS($3)) * (SIN(RADIANS(($4 - $2) / 2))) ^ 2))
     $$ LANGUAGE sql
    ;

    CREATE TEMPORARY TABLE T1 AS /* This is to get top 1000 address ids which are unique identifiers for addresses in terms of orders frequency which is decided by number of distinct ordering order ids */
    SELECT destination_address_id 
    ,COUNT(DISTINCT ordering_order_id)a
    ,COUNT(DISTINCT tracking_id) no_pkg
    FROM lmaa_pm.perfectmile_onroad_events_na
    where shipment_status = 'DELIVERED'              
    AND delivery_station_code = 'DCH1'
    AND event_day BETWEEN '2018-12-01' AND '2018-12-31'
    AND tracking_id IS NOT NULL
    GROUP BY destination_address_id,delivery_station_code
    ORDER BY a DESC
    LIMIT 1000
    ;

    CREATE TEMPORARY TABLE T1_2 AS /* This is to get tracking ids corresponding to those top 1000 address ids */
    SELECT DISTINCT destination_address_id
    ,tracking_id
    FROM lmaa_pm.perfectmile_onroad_events_na
    WHERE destination_address_id IN (SELECT destination_address_id FROM T1) 
    AND event_day BETWEEN '2018-12-01' AND '2018-12-31'
    AND shipment_status = 'DELIVERED'
    AND delivery_station_code = 'DCH1'
    AND tracking_id IS NOT NULL
    GROUP BY 1,2
    ;



    CREATE TEMPORARY TABLE T2 AS /* This is to get lat long pairs for addresses and delivery point respectively */
    SELECT DISTINCT gdd.lat1
    ,gdd.long1
    ,gdd.external_address_id destination_address_id
    ,gdd.tracking_id
    ,gdd.actual_lat
    ,gdd.actual_long
    ,ROW_NUMBER() OVER(PARTITION BY tracking_id ORDER BY deliverydate DESC) rn /* This is to avoid duplicates since this table contains duplicates */
    FROM gtech.geocoding_data_daily_na gdd
    WHERE gdd.shipment_status_id in (51,'DELIVERED')
    AND tracking_id IN(SELECT tracking_id FROM T1_2)
    AND confidence1 = 'high'
    AND gdd.station_code='DCH1'
    AND deliverydate BETWEEN '2018-12-01' AND '2018-12-31'
    AND actual_lat IS NOT NULL
    AND actual_long IS NOT NULL
    ;

    CREATE TEMPORARY TABLE T2_2 AS
    SELECT *
    FROM T2
    WHERE rn = 1
    ;


    CREATE TEMPORARY TABLE T3 AS 
    SELECT T2_2.lat1
    ,T2_2.long1
    ,T2_2.actual_lat
    ,T2_2.actual_long
    ,T2_2.tracking_id
    ,T2_2.destination_address_id
    ,CASE /* This function is for identifying distance deviations in the order of 0 - 10 metres, 10-20 metres and so on */
    WHEN f_stop_distance(lat1,long1,actual_lat,actual_long) <=10 THEN '0_to_10'
    WHEN f_stop_distance(lat1,long1,actual_lat,actual_long) >10 
    and  f_stop_distance(lat1,long1,actual_lat,actual_long) <=20 THEN '10_to_20'
    WHEN f_stop_distance(lat1,long1,actual_lat,actual_long)>20
    and f_stop_distance(lat1,long1,actual_lat,actual_long) <=50 THEN '20_to_50'
    WHEN f_stop_distance(lat1,long1,actual_lat,actual_long) >50 THEN 'gt_50'
    END AS Dev_from_address
    FROM T2_2
    ORDER BY T2_2.tracking_id
    ;

    CREATE TEMPORARY TABLE T4 AS /* Doing some percentage calculations based on the new buckets created in the previous temp table namely percentage calculations out of total  */
    SELECT SUM(CASE WHEN Dev_from_address = '0_to_10' THEN 1 ELSE 0 END)a
          ,SUM(CASE WHEN Dev_from_address = '10_to_20' THEN 1 ELSE 0 END)b
          ,SUM(CASE WHEN Dev_from_address = '20_to_50' THEN 1 ELSE 0 END)c
          ,SUM(CASE WHEN Dev_from_address = 'gt_50' THEN 1 ELSE 0 END)d
          ,tracking_id
          ,(a/(a+b+c+d)::DECIMAL(10,2) * 100) AS e
          ,(b/(a+b+c+d)::DECIMAL(10,2) * 100) AS f
          ,(c/(a+b+c+d)::DECIMAL(10,2) * 100) AS g
          ,(d/(a+b+c+d)::DECIMAL(10,2) * 100) AS h
    FROM T3
    GROUP BY tracking_id
    ;
    CREATE TEMPORARY TABLE T5 AS /* adding info for route id to the existing data */
    SELECT DISTINCT route_id
    ,tracking_id
    ,ROW_NUMBER() OVER (PARTITION BY tracking_id ORDER BY DATE DESC) rnnn /* to avoid duplicates */
    FROM omw.route_actuals_na
    WHERE tracking_id IN (SELECT tracking_id FROM T1_2)
    AND stop_type = 'Dropoff'
    AND scan_status  = 'DELIVERED'
    ;

    CREATE TEMPORARY TABLE T5_final AS
    SELECT *
    FROM T5
    WHERE rnnn = 1
    ;

    /* final select */
    SELECT  DISTINCT T1_2.destination_address_id
    ,T3.lat1
    ,T3.long1
    ,T3.actual_lat
    ,T3.actual_long
    ,T3.Dev_from_address
    ,T1_2.tracking_id
    ,T1.no_pkg 
    ,T4.e
    ,T4.f
    ,T4.g
    ,T4.h
    ,T5_final.route_id
    FROM T3
    JOIN T4 ON T4.tracking_id = T3.tracking_id
    JOIN T1 ON T1.destination_address_id = T3.destination_address_id
    JOIN T1_2 ON T1_2.destination_address_id = T3.destination_address_id
    JOIN T5_final ON T5_final.tracking_id = T3.tracking_id
    ORDER BY T1_2.destination_address_id

1 个答案:

答案 0 :(得分:0)

严格-那里没有完整的交叉联接-但是您可能有很多对很多的联接。 要对此进行跟踪,请尝试查看每个联接,以查看是否具有> 1的键值

select tracking_id,count(*) from t4 group by 1 having count(*) > 1;
select destination_address_id,count(*) from t1 group by 1 having count(*) > 1;
select tracking_id ,count(*) from t5_final group by 1 having count(*) > 1;

您返回的值可能是您的原因。这可以帮助您确定在哪里有多对多的加入。