使用Java将combineByKy用于reduceBykey

时间:2017-08-28 15:38:48

标签: apache-spark

我有以下数据集:

    key   value
---------------------------
    key1,CLASS-A,YES,1
    key2,CLASS-B,YES,2
    key2,CLASS-B,YES,1
    key1,CLASS-A,YES,4
    key3,CLASS-C,DEFAULT,1

OUTPUT应如下所示:

    key   value
---------------------------
key1,    CLASS-A,YES,5
key2,    CLASS-B,YES,3
key3,    CLASS-C,NO,1

在使用reduceByKey获取结果时,我发现只要有一个只有一个值的键,在这种情况下是key3,就不会调用reduceByKey,因为没有什么可以减少的。我得到:

   key   value
---------------------------
key1,    CLASS-A,YES,5
key2,    CLASS-B,YES,3
key3,    CLASS-C,DEAFULT,1

我可以使用Spark(Java)中的combineByKey实现此目的。

到目前为止我尝试了什么:

reduceByKey(
                    new Function2<String, String, String>() {

                        @Override
                        public String call(String s1, String s2) throws Exception {
                            String[] vals1 = StringUtils.split(s1, ",");
                            String[] vals2 = StringUtils.split(s2, ",");

                            String jobStat1 = vals1[0];
                            String jobStat2 = vals2[0];

                            String reducedJobStat;


                            boolean s1 = jobStat1.equals("YES")

                            boolean s2 = jobStat2.equals("YES");

                            if (s1 || s2) {

                                reducedJobStat = "YES";
                            } else {

                                reducedJobStat = "NO";
                            }


                            return reducedJobStat;
                        }
                    }
            )

2 个答案:

答案 0 :(得分:1)

spark中的reduceByKey和combineByKey之间的根本区别在于reduceByKey需要一个取一对值并返回单个值的函数,而combineByKey允许您同时转换数据,它需要三个函数。第一种是从现有值类型创建新值类型,第二种是将现有值类型添加到新值类型,第三种添加到新值类型。

我看到combineByKey的最好例子是http://codingjunkie.net/spark-combine-by-key/

对于您的具体情况,我建议保持简单,并使用reduceByKey,然后使用mapValues在key3上完成所需的转换。这可能类似于:

reduced_rdd.mapValues(v => (v._1, if (v._2 == "DEFAULT") "NO" else v._2, v._3))

答案 1 :(得分:0)

所以,我得到了一个使用combinerByKey的替代解决方案。 reduceByKey的代码看起来更简单,但我在reduceByKey之后做了一个mapValues(请查看原因的问题)以获得结果。

如果我们了解它的内部运作方式,CombineByKey就相当简单。

使用CombineByKey的示例

输入:

key   value

key1,CLASS-A,YES,1
key2,CLASS-B,YES,2
key2,CLASS-B,YES,1
key1,CLASS-A,YES,4
key3,CLASS-C,DEFAULT,1



 //The CreateCombiner will initialise the 1st Key in the 1st partition . Here Lets divide the input into 2 partitions:

 Partition 1:                                            Partition 2:

     key        value                                key   value
 ---------------------------                 ---------------------------
     key1,  CLASS-A,YES,1                       key1,           CLASS-A,YES,4
     key2,  CLASS-B,YES,2                       key3,           CLASS-C,DEFAULT,1
     key2,  CLASS-B,YES,1                        

public class CreateCombiner implements Function<String, String> {


    @Override
    public String call(String value) {  //value here is "CLASS-A,YES,1"  
        String jobStatus = value.split(",")[0];

        if (jobStatus.equals("YES")
                || jobStatus.equals("DEFAULT") {

            return "YES"+ " " + value.split(" ")[1] + " " + value.split(" ")[2];
        } else {
            return "NO" + " " + value.split(" ")[1] + " " + value.split(" ")[2];
        }

    }
}
 When the Key1 in 1st partition is encounterd, the CreateCombiner will initialise that key's. (key1 here)  value,In our case  we change the value(2nd string(YES/NO/DEFAULT)).
 Becase in my usecase I want to change all "DEFAULT" to "YES" .
 It replaces all the YES and DEFAULT strings to YES and otherwise to NO. Now Same for Key2 in the 1st partition .
 Again when it finds key2 in the 1st partition , the MergeValue class is called. It will merge the values . So here Key2 has 2 values(CLASS-B,YES,2 and CLASS-B,YES,1). It merges both.
 like (key2,CLASS-B,YES,3)

 The MergeCombiner  takes the combiners (tuples) created on each partition and merges them together. So in my case the logic is same as in MergeValue.


public class MergeValue implements Function2<String, String, String> {

    // MergeCombiner will decide the jobStatus and add the outCount and lbdCount.
    // This is a Merging function that takes a value and merges it into the previously collecte value(s).

    @Override
    public String call(String v1, String v2) throws Exception {


        String[] vals1 = StringUtils.split(v1, ",");
        String[] vals2 = StringUtils.split(v2, ",");

        String jobStat1 = vals1[0];
        String jobStat2 = vals2[0];

        String reducedJobStat;


        boolean stat1Process = (jobStat1.equals("YES"))
                || (jobStat1.equals("DEFAULT"));

          boolean stat2Process = (jobStat2.equals("YES"))
         || (jobStat2.equals("DEFAULT"));
        if (stat1Process || stat2Process) {

            reducedJobStat = "YES";
        } else {

            reducedJobStat = "NO";
        }

        int outCount = Integer.parseInt(vals1[1]) + Integer.parseInt(vals2[1]);

        int lbdCount = Integer.parseInt(vals1[2]) + Integer.parseInt(vals2[2]);
        return reducedJobStat + " " + Integer.toString(outCount) + " " + Integer.toString(lbdCount);

    }

}



public class MergeCombiner implements Function2<String, String, String> {

    // This fucntion combines the merged values together from MergeValue.
    // Basically this function takes the new values produced at the partition level and combines them until we end up
    // with one singular value.
    @Override
    public String call(String v1, String v2) throws Exception {


        String[] vals1 = StringUtils.split(v1, ",");
        String[] vals2 = StringUtils.split(v2, ",");

        String jobStat1 = vals1[0];
        String jobStat2 = vals2[0];

        String reducedJobStat;

        //Here we decide the jobStatus from 2 combiners , if both of them are complete ie jobStat1 and jobStat2 is COMP
        // LETE, then the Status is marked as complete.
        boolean stat1Process = (jobStat1.equals("YES");

        boolean stat2Process = (jobStat2.equals("YES");

        if (stat1Process || stat2Process) {

            reducedJobStat = "YES";
        } else {

            reducedJobStat = "YES";
        }

        int outCount = Integer.parseInt(vals1[1]) + Integer.parseInt(vals2[1]);

        int lbdCount = Integer.parseInt(vals1[2]) + Integer.parseInt(vals2[2]);

        return reducedJobStat + " " + Integer.toString(outCount) + " " + Integer.toString(lbdCount);

    }

调用combineByKey

combineByKey(new CreateCombiner(), new MergeValue(), new MergeCombiner());

相同的代码使用reduceByKey实现:

 reduceByKey(
                         new Function2<String, String, String>() {

                             @Override
                             public String call(String s1, String s2) throws Exception {
                                 String[] vals1 = StringUtils.split(s1, " ");
                                 String[] vals2 = StringUtils.split(s2, " ");

                                 String jobStat1 = vals1[0];
                                 String jobStat2 = vals2[0];

                                 String reducedJobStat;

                                 boolean stat1Process = (jobStat1.equals("YES")) ||
                                         (jobStat1.equals("DEFAULT");

                                 boolean stat2Process = (jobStat2.equals("YES")) ||
                                         (jobStat2.equals("DEFAULT");

                                 if (stat1Process || stat2Process) {

                                     reducedJobStat = "YES";
                                 } else {

                                     reducedJobStat = "NO";
                                 }

                                 int outCount = Integer.parseInt(vals1[1]) + Integer.parseInt(vals2[1]);

                                 int lbdCount = Integer.parseInt(vals1[2]) + Integer.parseInt(vals2[2]);

                                 return reducedJobStat + " " + Integer.toString(outCount) + " " + Integer.toString(lbdCount);
                             }
                         } ).mapValues(new Function<String, String>() {
                     @Override
                     public String call(String s) throws Exception {
                         String jobStatus = s.split(" ")[0];

                         if (jobStatus.equals("YES")
                                 || jobStatus.equals("DEFAULT") {


                             return "YES" + " " + s.split(" ")[1] + " " + s.split(" ")[2];
                         } else {
                             return "NO" + " " + s.split(" ")[1] + " " + s.split(" ")[2];

                         }


                     }
                 });