Pig嵌套foreach在Spark 2.0中

时间:2017-08-23 17:18:49

标签: apache-spark apache-pig spark-dataframe

我正在尝试将Pig脚本转换为Spark 2例程。

groupBy内,我想计算与特定状态匹配的元素数量。 PIG代码如下所示:

A = foreach (group payment by customer) {
    done = filter payment by state == 'done';
    doing = filter payment by state == 'doing';
    cancelled = filter payment by ETAT == 'cancelled';
    generate group as customer, COUNT(done) as nb_done, COUNT(doing) as nb_doing, COUNT(cancelled) as nb_cancelled;
};

我想将其改编为从payment.groupBy("customer")开始的数据框。

谢谢!

1 个答案:

答案 0 :(得分:0)

尝试类似的东西:

假设客户表在以下架构的火花会话中注册:

 customer.registerTempTable("customer");
 sparkSession.sql("describe customer").show();

+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|      id|   string|   null|
|   state|   string|   null|
+--------+---------+-------+

- 使用地图分组

sparkSession.sql("select id, count(state['done']) as done," +
                "count(state['doing']) as doing," +
                "count(state['cancelled']) as cancelled 
              from (select id,map(state,1) as state from customer) t group by id").show();