Hadoop排名列

时间:2012-12-07 22:58:45

标签: hadoop hive hiveql

所以我有这些我正在使用的特定列: customer_tokenmerchant_idmerchant_category_codetransaction_amount

我目前的疑问是:

SELECT customer_token, COUNT(transaction_amount), SUM(transaction_amount)
FROM transaction 
                     WHERE file_date>20121031 
                     and file_date<20121201
GROUP BY customer_token

我想在上面的查询中添加一个部分,在结果中,根据每个特定merchant_category_code中的交易金额,将merchant_category_code分成不同的列。结果看起来像这样:

  

customer_token,count(transaction_amount),sum(transaction_amount),count(merchant_category_code中排名为1的transaction_amount),count(merchant_category_code中排名为2的transaction_amount),count(merchant_category_code中排名为3的transaction_amount)等。 ..

然后这个:

  

customer_token,count(transaction_amount),sum(transaction_amount),sum(merchant_category_code中排名为1的transaction_amount),sum(merchant_category_code中排名为2的transaction_amount),sum(merchant_category_code中排名为3的transaction_amount)等。 ..

但我对如何做到这一点感到茫然,或者甚至在可能的情况下都是如此。

1 个答案:

答案 0 :(得分:2)

如果您事先知道merchant_category_code的可能值是什么,则可以使用CASE表达式:

SELECT customer_token,
       COUNT(transaction_amount),
       SUM(transaction_amount),
       COUNT(CASE WHEN merchant_category_code = 1 THEN transaction_amount END),
       COUNT(CASE WHEN merchant_category_code = 2 THEN transaction_amount END),
       COUNT(CASE WHEN merchant_category_code = 3 THEN transaction_amount END),
       ...
       SUM(CASE WHEN merchant_category_code = 1 THEN transaction_amount END),
       SUM(CASE WHEN merchant_category_code = 2 THEN transaction_amount END),
       SUM(CASE WHEN merchant_category_code = 3 THEN transaction_amount END),
       ...
  FROM transaction 
 WHERE file_date BETWEEN 20121101 AND 20121130
 GROUP
    BY customer_token
;

(或IF表达式,如果您愿意;有关这两者的文档,请参阅the section titled "Conditional Functions" on the page "LanguageManual+UDF" in the Hive wiki)。