如何在Spark中进行地图处理的地图

时间:2018-01-25 09:07:45

标签: apache-spark spark-dataframe spark-streaming

我有一个csv,如下所示,

T1,Data1,1278

T1,Data1,1279

T1,Data1,1280

T1,Data2,1283 

T1,Data2,1284  

T2,Data1,1278

T2,Data1,1290

我想将JavaPairRdd创建为Map of Map,如下所示

T1,[(Data1, (1278,1279,1280)), (Data2, (1283,1284))]
T2,[(Data1, (1278,1290))]

我尝试使用combinebykey使用以下代码

创建JavaPairRDD
JavaPairRDD<Timestamp,List<Tuple2<String,List<Integer>>>> itemRDD = myrdd.mapToPair(new PairFunction<Row, Timestamp, Tuple2<String,Integer>>() {
    @Override
    public Tuple2<Timestamp, Tuple2<String, Integer>> call(Row row) throws Exception {
        Tuple2<Timestamp, Tuple2<String, Integer>> txInfo = new Tuple2<Timestamp, Tuple2<String, Integer>>(row.getTimestamp(0), new Tuple2<String, Integer>(row.getString(1), row.getInt(2)));
        return txInfo;
    }
}).combineByKey(createAcc,addItem,combine)

但是我无法像上面那样创建一个PairRdd。我的方法是否正确? combinebykey是否可用于在spark中创建地图地图?

1 个答案:

答案 0 :(得分:1)

尝试使用cogroup方法代替combineByKey