如何迭代spark cogroup值

时间:2016-08-12 07:15:12

标签: java apache-spark rdd

1) for Categories

twitter handle , categories , sub_categories 

handle        ,  Products ,    MakeUp 
handle        ,  Health,     MakeUp
handle2        , Services ,     Face
handle3         , Marketing ,    Soap

JavaPairRDD<String ,Category> categoryPairRDD

2) For Twitter 

Twitter handle , twitter_post , twitter_likes 

 handle                "Iphone"              , 10 
 handle2               "Samsung"                 ,20


JavaPairRDD<String ,Twitter>  twitterPairRDD


JavaPairRDD<String, Tuple2<Iterable<Ontologies>, Iterable<Twitter>>> grouped = categoryPairRDD
           .cogroup(twitterPairRDD);

我应该如何迭代cogroup值,以便在找到对象的情况下为If键打印值,否则 打印空值

即。在我的categoryPairRDD handle3存在,但它在twitterRDD中缺席所以输出密钥handle3应该是

handle3 , Marketing , Soap , null , null

最终出局应该是

handle , Products , Makeup  , Iphone , 10
handle , Health , Makeup ,  , Iphone, 10 
handle2 , Services , Face , Samsung , 20
handle3  , Marketing, Soap ,  null , null

1 个答案:

答案 0 :(得分:1)

管理以获得解决方案

JavaPairRDD<String, Tuple2<Ontologies, Optional<twitterPairRDD>>> left =  ontologiesPair.leftOuterJoin(twitterPairRDD);

    left.foreach(new VoidFunction<Tuple2<String,Tuple2<Ontologies,Optional<Twitter>>>>() {

        @Override
        public void call(Tuple2<String, Tuple2<Ontologies, Optional<Instagram>>> arg0) throws Exception {
            try{
                 Optional<Twitter> tweet = arg0._2._2();
                 //print values from tuple ie arg0._2._1() and tweet    object      
              }   
               catch(Exception e){
                Twitter tweet = new Twitter("",-1);
               //Print values from arg0._2._1() and empty tweet object
            }

但我仍然想知道使用联合组织的任何答案