无法在hive查询

时间:2017-11-09 16:32:53

标签: hadoop join hive case

我有以下数据:

SELECT 
    mtrans.merch_num,
    mtrans.card_num 
FROM a_sbp_db.merch_trans_daily mtrans 
INNER JOIN a_sbp_db.product_holding ph ON mtrans.card_num = ph.acc_num 
INNER JOIN a_sbp_db.cust_demo cdemo ON cdemo.cust_id = ph.cust_id
WHERE mtrans.transaction_date LIKE '2017-09%' AND person_org_code='P' AND ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 30;



+-----------+----------------------------+
| merch_num | card_num                   |
+-----------+----------------------------+
|         1 | 4658XXXXXXXXXXXXXXXXXXURMX |
|         2 | 4658XXXXXXXXXXXXXXXXXXIE6X |
|         2 | 4658XXXXXXXXXXXXXXXXXXDA8X |
|         2 | 4658XXXXXXXXXXXXXXXXXX7D1X |
|         2 | 4658XXXXXXXXXXXXXXXXXXTJ2X |
|         2 | 4658XXXXXXXXXXXXXXXXXXQQWX |
|         2 | 4659XXXXXXXXXXXXXXXXXXY4EX |
|         2 | 4658XXXXXXXXXXXXXXXXXXRDOX |
|         2 | 4658XXXXXXXXXXXXXXXXXX0O3X |
|         2 | 4658XXXXXXXXXXXXXXXXXXNVBX |
+-----------+----------------------------+

我想通过merch_num聚合trans_amt,只要我得到的唯一card_num超过1。

在简单查询中,我可以这样做:

SELECT 
    mtrans.merch_num,
FROM_UNIXTIME(UNIX_TIMESTAMP(),'MMM-yyyy') AS process_month,
SUM(mtrans.trans_amt) AS total_age_less_30_1 
FROM a_sbp_db.merch_trans_daily mtrans 
INNER JOIN a_sbp_db.product_holding ph ON mtrans.card_num = ph.acc_num 
INNER JOIN a_sbp_db.cust_demo cdemo ON cdemo.cust_id = ph.cust_id
WHERE mtrans.transaction_date LIKE '2017-09%' AND person_org_code='P' AND  ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 30 
GROUP BY 
    mtrans.merch_num having count(distinct mtrans.card_num) > 1;

+-----------+---------------+---------------------+
| merch_num | process_month | total_age_less_30_1 |
+-----------+---------------+---------------------+
|         2 | Nov-2017      | 2147.5              |
+-----------+---------------+---------------------+

在这里,我可以跳过商家 - 5493036,因为它没有超过1的唯一卡片。

但我有多种条件,其中&amp;想只写1个查询。 使用case语句我可以像下面这样做:

SELECT mtrans.merch_num,
    FROM_UNIXTIME(UNIX_TIMESTAMP(),'MMM-yyyy') AS process_month,
    NVL(SUM(CASE
        WHEN (ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 30)
            THEN mtrans.trans_amt ELSE 0 END), NULL)
            AS total_age_less_30_1,
    NVL(SUM(CASE
        WHEN (ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) >= 30
                    AND ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 40)
            THEN mtrans.trans_amt ELSE 0 END), NULL)
            AS total_age_30_40_1
FROM a_sbp_db.merch_trans_daily mtrans
INNER JOIN a_sbp_db.product_holding ph ON mtrans.card_num = ph.acc_num
INNER JOIN a_sbp_db.cust_demo cdemo ON cdemo.cust_id = ph.cust_id   
WHERE mtrans.transaction_date LIKE '2017-09%'
    AND person_org_code='P'
GROUP BY
    mtrans.merch_num

+-----------+---------------+---------------------+-------------------+
| merch_num | process_month | total_age_less_30_1 | total_age_30_40_1 |
+-----------+---------------+---------------------+-------------------+
|       3   | Nov-2017      | 0                   | 0                 |
|       4   | Nov-2017      | 0                   | 0                 |
|       1   | Nov-2017      | 2.49                | 203.68            |
|       2   | Nov-2017      | 2147.5              | 4907              |
|       5   | Nov-2017      | 0                   | 0                 |
+-----------+---------------+---------------------+-------------------+

我想将2.49作为该商家的NULL,超过1张唯一的卡片不存在。

我无法申请条件来检查唯一卡号是否超过1只有我必须显示总和(trans_amt)

当我在case语句中申请和条件时,我得到以下错误:

SELECT 
    mtrans.merch_num,
    FROM_UNIXTIME(UNIX_TIMESTAMP(),'MMM-yyyy') AS process_month,
    NVL(SUM(CASE
        WHEN (ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 30 and count(distinct mtrans.card_num) > 1) 
            THEN mtrans.trans_amt ELSE 0 END), NULL)
            AS total_age_less_30_1,
    NVL(SUM(CASE
        WHEN (ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) >= 30
                    AND     ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 40 and count(distinct mtrans.card_num) > 1)
            THEN mtrans.trans_amt ELSE 0 END), NULL)
            AS total_age_30_40_1                
FROM a_sbp_db.merch_trans_daily mtrans 
INNER JOIN a_sbp_db.product_holding ph ON mtrans.card_num = ph.acc_num 
INNER JOIN a_sbp_db.cust_demo cdemo ON cdemo.cust_id = ph.cust_id
WHERE mtrans.transaction_date LIKE '2017-09%' 
    AND person_org_code='P' 
GROUP BY 
    mtrans.merch_num;


ERROR: AnalysisException: aggregate function must not contain aggregate parameters: sum(CASE WHEN (round(datediff(mtrans.transaction_date, cdemo.date_birth) / 365) < 30 AND count(DISTINCT mtrans.card_num) > 1) THEN mtrans.trans_amt ELSE 0 END)

有人可以帮忙吗?

3 个答案:

答案 0 :(得分:0)

错误似乎是因为你在SUM语句中有计数。这是你必须尝试的,让我知道它是怎么回事:

SELECT 
    mtrans.merch_num,
    FROM_UNIXTIME(UNIX_TIMESTAMP(),'MMM-yyyy') AS process_month,
    NVL(CASE
        WHEN (ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 30 and count(distinct mtrans.card_num) > 1) 
            THEN SUM(mtrans.trans_amt) ELSE 0 END, NULL)
            AS total_age_less_30_1,
    NVL(CASE
        WHEN (ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) >= 30
                    AND     ROUND(DATEDIFF(mtrans.transaction_date,cdemo.date_birth)/365) < 40 and count(distinct mtrans.card_num) > 1)
            THEN SUM(mtrans.trans_amt) ELSE 0 END, NULL)
            AS total_age_30_40_1                
FROM a_sbp_db.merch_trans_daily mtrans 
INNER JOIN a_sbp_db.product_holding ph ON mtrans.card_num = ph.acc_num 
INNER JOIN a_sbp_db.cust_demo cdemo ON cdemo.cust_id = ph.cust_id
WHERE mtrans.transaction_date LIKE '2017-09%' 
    AND person_org_code='P' 
GROUP BY 
    mtrans.merch_num;

答案 1 :(得分:0)

我建议以更好的方式做到如下。

  

(PS:我没有任何hive访问权限,所以我使用常规SQL使用Postgresql进行此操作。因此,应该更容易适应Hive SQL。)

这是我在表格中插入的SQL表和记录。

CREATE TEMPORARY TABLE hivetest (
   merchant_id INTEGER,
   card_number TEXT,
   customer_dob TIMESTAMP,
   transaction_dt TIMESTAMP,
   transaction_amt DECIMAL
);

INSERT INTO hivetest VALUES
(1, 'A', '1997-12-01', '2017-11-01', 10.0),
(2, 'A', '1997-12-01', '2017-11-01', 11.0),
(2, 'B', '1980-12-01', '2017-11-01', 12.0),
(3, 'A', '1997-12-01', '2017-11-01', 13.0),
(3, 'A', '1997-12-01', '2017-11-01', 14.0),
(4, 'A', '1997-12-01', '2017-11-01', 15.0),
(4, 'C', '1980-12-01', '2017-11-01', 16.0);

首先,您需要连接表并生成一个数据集,为您提供transaction_age (transaction_dt - customer_dob)。我在这个单表中有大部分数据用于日期减法,但是简单的INNER JOIN应该足以实现这一点。无论如何,这里是相同的查询。

SELECT
    merchant_id, card_number, DATE(customer_dob) customer_dob, DATE(transaction_dt) transaction_dt,
    DATE_PART('year', DATE(transaction_dt)) - DATE_PART('year', DATE(customer_dob)) transaction_age,
    transaction_amt
FROM hivetest ORDER BY 1;

这导致数据如下。

+-------------+-------------+--------------+----------------+-----------------+----------------+
| merchant_id | card_number | customer_dob | transaction_dt | transaction_age |transaction_amt |
+-------------+-------------+--------------+----------------+-----------------+----------------+
|           1 |      A      | 1997-12-01   | 2017-11-01     |              20 |           10.0 |
|           2 |      A      | 1997-12-01   | 2017-11-01     |              20 |           11.0 |
|           2 |      B      | 1980-12-01   | 2017-11-01     |              37 |           12.0 |
|           3 |      A      | 1997-12-01   | 2017-11-01     |              20 |           13.0 |
|           3 |      A      | 1997-12-01   | 2017-11-01     |              20 |           14.0 |
|           4 |      A      | 1997-12-01   | 2017-11-01     |              20 |           15.0 |
|           4 |      C      | 1980-12-01   | 2017-11-01     |              37 |           16.0 |
+-------------+-------------+--------------+----------------+-----------------+----------------+

上述数据集允许您根据需要对transaction_age的交易金额进行分类。诀窍是在子查询中使用上述查询并使用此子查询的结果进行分类。以下是执行相同操作的查询。

SELECT
    merchant_id,
    -- Transaction Age less than 30
    SUM(CASE WHEN transaction_age <= 30 THEN 1 ELSE 0 END) count_30,
    SUM(CASE WHEN transaction_age <= 30 THEN transaction_amt ELSE 0 END) sum_30,

    -- Transaction Age between 30 and 40
    SUM(CASE WHEN transaction_age > 30 AND transaction_age <= 40 THEN 1 ELSE 0 END) case_30_40,
    SUM(CASE WHEN transaction_age > 30 AND transaction_age <= 40 THEN transaction_amt ELSE 0 END) sum_30_40
FROM
(
    SELECT
        merchant_id, transaction_amt,
        DATE_PART('year', DATE(transaction_dt)) - DATE_PART('year', DATE(customer_dob)) transaction_age
    FROM hivetest
) m
GROUP BY merchant_id ORDER BY 1;

这导致分类输出如下所示,它为您提供每个商家的每个类别的交易数量和交易金额总和:

+-------------+----------+--------+------------+-----------+
| merchant_id | count_30 | sum_30 | case_30_40 | sum_30_40 |
+-------------+----------+--------+------------+-----------+
|           1 |        1 |   10.0 |          0 |         0 |
|           2 |        1 |   11.0 |          1 |      12.0 |
|           3 |        2 |   27.0 |          0 |         0 |
|           4 |        1 |   15.0 |          1 |      16.0 |
+-------------+----------+--------+------------+-----------+

现在,这是我们的数据集,它或多或少是最终结果。但是,根据您的要求,您只对拥有1张以上唯一卡片的商家感兴趣(COUNT(DISTINCT card_number) > 1)。

所以,让我们写另一个查询给我们这个。以下是计算此信息的查询,并根据标准将标记标记为TRUE或FALSE,表示我们是否对该商家感兴趣。

SELECT
    merchant_id,
    CASE
        WHEN COUNT(DISTINCT card_number) > 1 THEN
            TRUE
        ELSE
            FALSE
    END has_distinct_cards_gt_1
FROM hivetest GROUP BY merchant_id ORDER BY 1

这给出了如下输出。

+-------------+-------------------------+
| merchant_id | has_distinct_cards_gt_1 |
+-------------+-------------------------+
|           1 |                   false |
|           2 |                   true  |
|           3 |                   false |
|           4 |                   true  |
+-------------+-------------------------+

现在,我们差不多完成了。我们只需要连接这两个表,然后基于has_distinct_cards_gt_1,从先前生成的数据集中相应地显示列。

这是生成的最终连接查询和结果集数据。

SELECT
    merchants_all.merchant_id,

    -- Age < 30
    CASE
        WHEN merchants_cards.has_distinct_cards_gt_1 THEN
            sum_30
        ELSE
            0
    END total_sum_30,

    -- Age in 30 and 40
    CASE
        WHEN merchants_cards.has_distinct_cards_gt_1 THEN
            sum_30_40
        ELSE
            0
    END total_sum_30_40
FROM
  (
      SELECT
            merchant_id,
            SUM(CASE WHEN transaction_age <= 30 THEN transaction_amt ELSE 0 END) sum_30,
            SUM(CASE WHEN transaction_age > 30 AND transaction_age <= 40 THEN transaction_amt ELSE 0 END) sum_30_40
      FROM
      ( 
            SELECT merchant_id, DATE_PART('year', DATE(transaction_dt)) - DATE_PART('year', DATE(customer_dob)) transaction_age, transaction_amt
            FROM hivetest
      ) m
      GROUP BY merchant_id
  ) merchants_all
JOIN
  (
     SELECT merchant_id, CASE WHEN COUNT(DISTINCT card_number) > 1 THEN TRUE ELSE FALSE END has_distinct_cards_gt_1
     FROM hivetest GROUP BY merchant_id ORDER BY 1
  ) merchants_cards
ON
(merchants_all.merchant_id = merchants_cards.merchant_id);

这会生成您需要的最终数据。

+-------------+--------------+-----------------+
| merchant_id | total_sum_30 | total_sum_30_40 |
+-------------+--------------+-----------------+
|           1 |            0 |               0 |
|           2 |         11.0 |            12.0 |
|           3 |            0 |               0 |
|           4 |         15.0 |            16.0 |
+-------------+--------------+-----------------+

如果有帮助,请告诉我。

答案 2 :(得分:0)

SUM中的COUNT是问题所在。 这是一个解决方案。我还没有测试过它。 表person_org_code属于哪个表格并不明显。如果它在merch_trans_daily中,那么添加person_org_code =&#39; P&#39;到视图中的where子句。让我们知道它是否有效!

WITH mtrans_count AS
(SELECT merch_num,
        COUNT(1) AS cnt
   FROM a_sbp_db.merch_trans_daily
  WHERE mtrans.transaction_date LIKE '2017-09%'  
)
SELECT mtrans.merch_num
    ,FROM_UNIXTIME(UNIX_TIMESTAMP(), 'MMM-yyyy') AS process_month
    ,NVL(SUM(CASE 
                WHEN (
                        ROUND(DATEDIFF(mtrans.transaction_date, cdemo.date_birth) / 365) < 30
                        AND mtrans_count.cnt > 1
                        )
                    THEN mtrans.trans_amt
                ELSE 0
                END), NULL) AS total_age_less_30_1
    ,NVL(SUM(CASE 
                WHEN (
                        ROUND(DATEDIFF(mtrans.transaction_date, cdemo.date_birth) / 365) >= 30
                        AND ROUND(DATEDIFF(mtrans.transaction_date, cdemo.date_birth) / 365) < 40
                        AND mtrans_count.cnt > 1
                        )
                    THEN mtrans.trans_amt
                ELSE 0
                END), NULL) AS total_age_30_40_1
FROM a_sbp_db.merch_trans_daily mtrans
INNER JOIN a_sbp_db.product_holding ph ON mtrans.card_num = ph.acc_num
INNER JOIN a_sbp_db.cust_demo cdemo ON cdemo.cust_id = ph.cust_id
INNER JOIN mtrans_count  ON mtrans_count.merch_num = mtrans.merch_num
WHERE mtrans.transaction_date LIKE '2017-09%'
    AND person_org_code = 'P'
GROUP BY mtrans.merch_num;