从行SQL配置单元中删除重复项

时间:2018-03-07 14:10:53

标签: hive duplicates

我有这张桌子

+------------+---------------------+--+
|  country   |      commodity      |
+------------+---------------------+--+
| Argentina  | Copper, molybdenum  |
| Argentina  | Silver, lead        |
| Argentina  | Copper, gold        |
| Argentina  | Copper, gold        |
| Argentina  | Copper              |
| Spain      | Rhodochrosite       |
| Spain      | Copper              |
| Spain      | Limestone           |
| Spain      | Gold                |
| Spain      | Limestone           |
+------------+---------------------+--+

我想显示这个

+------------+-----------------------------------------+--+
|  country   |                   minerals              |
+------------+-----------------------------------------+--+
| Argentina  | copper, molybdenum, silver, lead, gold  |
| Spain      | rhodochrosite, copper, limestone, gold  |
+------------+-----------------------------------------+--+

所以我想在一列“矿物”中加入每个国家的所有商品并消除重复,但在原始栏目“商品”中,你可以在第一张表中看到超过1种矿物质,也可以更低或大写黄金,黄金等。

我试过

SELECT country, CONCAT_WS(', ' ,COLLECT_SET(LOWER(commodity))) as minerals 
FROM depositOPT 
GROUP BY country;

但它并没有消除重复,因为输出看起来像这样

    +------------+------------------------------------------------------------------------
    |  country   |                   minerals                  
    +------------+------------------------------------------------------------------------
    | Argentina  | copper, molybdenum, silver, lead, copper, gold, copper, gold, copper  
    | Spain      | rhodochrosite, copper, limestone, gold, limestone    
    +------------+------------------------------------------------------------------------

感谢您的建议。

1 个答案:

答案 0 :(得分:1)

我会将商品列拆分为单独的矿物,以便逐个删除重复数据。希望这可以帮助。感谢。

x