Question

我在HDFS上有镶木地板文件，其中包含以下格式的记录：

unique_id_1 | state_1| prop1| prop2 | prop3
unique_id_1 | state_2| prop4| prop5 | prop6
unique_id_1 | state_3| prop8| prop8 | prop9 
unique_id_2 | state_1| prop1| prop2 | prop3
unique_id_2 | state_2| prop4| prop5 | prop6
unique_id_2 | state_3| prop7| prop8 | prop9

每条unique_id_x记录恰好有3次，每个唯一键有3个独立状态，以及不同的其他属性。

我需要做的是通过unique_id对所有记录进行分组，并合并每个组中的属性 - 创建一个全新的记录，该记录将写入输出拼花文件。

我已经研究了Spark的RDD groupby 然后 mapValues - 这给了我一个具有3种不同状态的3行Iterable - 我可以构建我的新行就好了。

但是我已经看到很多反对使用 rdd.groupBy（）。mapValues（）方法的建议，因为性能原因 - 它需要洗牌很多数据（这3条记录）需要最终使用相同的reducer），并且它不会使用分区本地的组合器来减少数据。

列出的其他列出的here建议使用aggregate或groupByKey - 但是如果你只有一个简单的sum（）或count（）聚合函数，这似乎也有效 - 而不是像我需要的那样自定义逻辑。 / p>

有没有比当前组更好的方法我必须达到最终结果？

编辑：我已经看到this关于分组的回答：＆＃34; groupByKey对于我们想要一个＆＃34; smallish＆＃34;每个键的值集合，如问题所示。＆＃34;鉴于我的情况，我每个键总是有3个值 - 这种方法是否适合这种情况的最佳方法？

编辑2：附加代码：

SQLContext sqlContext = SQLContext.getOrCreate(sparkContext.sc());

    JavaRDD<MyBean> rdd = sqlContext
            .read()
            .parquet(inputLocation)
            .toJavaRDD()
            .groupBy((Function<Row, String>) v1 -> v1.getAs("record_id"))
            .mapValues((Function<Iterable<Row>, MyBean>) records -> {
            //getting the 3 states from the interator and merging them
                Supplier<Stream<Row>> rowSupplier = () -> StreamSupport.stream(records.spliterator(), false);
                Optional<Row> state1Row = rowSupplier.get().filter(row -> row.getAs("state").equals("state_1")).findFirst();
                Optional<Row> state2Row = rowSupplier.get().filter(row -> row.getAs("state").equals("state_2")).findFirst();
                Optional<Row> state3Row = rowSupplier.get().filter(row -> row.getAs("state").equals("state_3")).findFirst();

                return merge(state1Row, state2Row, state3Row); //contains the merge logic - returns an instance of MyBean

            })
            .map((Function<Tuple2<String, MyBean>, MyBean>) Tuple2::_2); //not interested in the record_id anymore
     sqlContext.createDataFrame(rdd, MyBean.class).write().parquet(outputLocation);

Spark group by具有自定义逻辑性能

0 个答案: