我有以下数据集:
key value
---------------------------
key1,CLASS-A,YES,1
key2,CLASS-B,YES,2
key2,CLASS-B,YES,1
key1,CLASS-A,YES,4
key3,CLASS-C,DEFAULT,1
OUTPUT应如下所示:
key value
---------------------------
key1, CLASS-A,YES,5
key2, CLASS-B,YES,3
key3, CLASS-C,NO,1
在使用reduceByKey获取结果时,我发现只要有一个只有一个值的键,在这种情况下是key3,就不会调用reduceByKey,因为没有什么可以减少的。我得到:
key value
---------------------------
key1, CLASS-A,YES,5
key2, CLASS-B,YES,3
key3, CLASS-C,DEAFULT,1
我可以使用Spark(Java)中的combineByKey实现此目的。
到目前为止我尝试了什么:
reduceByKey(
new Function2<String, String, String>() {
@Override
public String call(String s1, String s2) throws Exception {
String[] vals1 = StringUtils.split(s1, ",");
String[] vals2 = StringUtils.split(s2, ",");
String jobStat1 = vals1[0];
String jobStat2 = vals2[0];
String reducedJobStat;
boolean s1 = jobStat1.equals("YES")
boolean s2 = jobStat2.equals("YES");
if (s1 || s2) {
reducedJobStat = "YES";
} else {
reducedJobStat = "NO";
}
return reducedJobStat;
}
}
)
答案 0 :(得分:1)
spark中的reduceByKey和combineByKey之间的根本区别在于reduceByKey需要一个取一对值并返回单个值的函数,而combineByKey允许您同时转换数据,它需要三个函数。第一种是从现有值类型创建新值类型,第二种是将现有值类型添加到新值类型,第三种添加到新值类型。
我看到combineByKey的最好例子是http://codingjunkie.net/spark-combine-by-key/
对于您的具体情况,我建议保持简单,并使用reduceByKey,然后使用mapValues在key3上完成所需的转换。这可能类似于:
reduced_rdd.mapValues(v => (v._1, if (v._2 == "DEFAULT") "NO" else v._2, v._3))
答案 1 :(得分:0)
所以,我得到了一个使用combinerByKey的替代解决方案。 reduceByKey的代码看起来更简单,但我在reduceByKey之后做了一个mapValues(请查看原因的问题)以获得结果。
如果我们了解它的内部运作方式,CombineByKey就相当简单。
输入:
key value
key1,CLASS-A,YES,1
key2,CLASS-B,YES,2
key2,CLASS-B,YES,1
key1,CLASS-A,YES,4
key3,CLASS-C,DEFAULT,1
//The CreateCombiner will initialise the 1st Key in the 1st partition . Here Lets divide the input into 2 partitions:
Partition 1: Partition 2:
key value key value
--------------------------- ---------------------------
key1, CLASS-A,YES,1 key1, CLASS-A,YES,4
key2, CLASS-B,YES,2 key3, CLASS-C,DEFAULT,1
key2, CLASS-B,YES,1
public class CreateCombiner implements Function<String, String> {
@Override
public String call(String value) { //value here is "CLASS-A,YES,1"
String jobStatus = value.split(",")[0];
if (jobStatus.equals("YES")
|| jobStatus.equals("DEFAULT") {
return "YES"+ " " + value.split(" ")[1] + " " + value.split(" ")[2];
} else {
return "NO" + " " + value.split(" ")[1] + " " + value.split(" ")[2];
}
}
}
When the Key1 in 1st partition is encounterd, the CreateCombiner will initialise that key's. (key1 here) value,In our case we change the value(2nd string(YES/NO/DEFAULT)).
Becase in my usecase I want to change all "DEFAULT" to "YES" .
It replaces all the YES and DEFAULT strings to YES and otherwise to NO. Now Same for Key2 in the 1st partition .
Again when it finds key2 in the 1st partition , the MergeValue class is called. It will merge the values . So here Key2 has 2 values(CLASS-B,YES,2 and CLASS-B,YES,1). It merges both.
like (key2,CLASS-B,YES,3)
The MergeCombiner takes the combiners (tuples) created on each partition and merges them together. So in my case the logic is same as in MergeValue.
public class MergeValue implements Function2<String, String, String> {
// MergeCombiner will decide the jobStatus and add the outCount and lbdCount.
// This is a Merging function that takes a value and merges it into the previously collecte value(s).
@Override
public String call(String v1, String v2) throws Exception {
String[] vals1 = StringUtils.split(v1, ",");
String[] vals2 = StringUtils.split(v2, ",");
String jobStat1 = vals1[0];
String jobStat2 = vals2[0];
String reducedJobStat;
boolean stat1Process = (jobStat1.equals("YES"))
|| (jobStat1.equals("DEFAULT"));
boolean stat2Process = (jobStat2.equals("YES"))
|| (jobStat2.equals("DEFAULT"));
if (stat1Process || stat2Process) {
reducedJobStat = "YES";
} else {
reducedJobStat = "NO";
}
int outCount = Integer.parseInt(vals1[1]) + Integer.parseInt(vals2[1]);
int lbdCount = Integer.parseInt(vals1[2]) + Integer.parseInt(vals2[2]);
return reducedJobStat + " " + Integer.toString(outCount) + " " + Integer.toString(lbdCount);
}
}
public class MergeCombiner implements Function2<String, String, String> {
// This fucntion combines the merged values together from MergeValue.
// Basically this function takes the new values produced at the partition level and combines them until we end up
// with one singular value.
@Override
public String call(String v1, String v2) throws Exception {
String[] vals1 = StringUtils.split(v1, ",");
String[] vals2 = StringUtils.split(v2, ",");
String jobStat1 = vals1[0];
String jobStat2 = vals2[0];
String reducedJobStat;
//Here we decide the jobStatus from 2 combiners , if both of them are complete ie jobStat1 and jobStat2 is COMP
// LETE, then the Status is marked as complete.
boolean stat1Process = (jobStat1.equals("YES");
boolean stat2Process = (jobStat2.equals("YES");
if (stat1Process || stat2Process) {
reducedJobStat = "YES";
} else {
reducedJobStat = "YES";
}
int outCount = Integer.parseInt(vals1[1]) + Integer.parseInt(vals2[1]);
int lbdCount = Integer.parseInt(vals1[2]) + Integer.parseInt(vals2[2]);
return reducedJobStat + " " + Integer.toString(outCount) + " " + Integer.toString(lbdCount);
}
调用combineByKey
combineByKey(new CreateCombiner(), new MergeValue(), new MergeCombiner());
相同的代码使用reduceByKey实现:
reduceByKey(
new Function2<String, String, String>() {
@Override
public String call(String s1, String s2) throws Exception {
String[] vals1 = StringUtils.split(s1, " ");
String[] vals2 = StringUtils.split(s2, " ");
String jobStat1 = vals1[0];
String jobStat2 = vals2[0];
String reducedJobStat;
boolean stat1Process = (jobStat1.equals("YES")) ||
(jobStat1.equals("DEFAULT");
boolean stat2Process = (jobStat2.equals("YES")) ||
(jobStat2.equals("DEFAULT");
if (stat1Process || stat2Process) {
reducedJobStat = "YES";
} else {
reducedJobStat = "NO";
}
int outCount = Integer.parseInt(vals1[1]) + Integer.parseInt(vals2[1]);
int lbdCount = Integer.parseInt(vals1[2]) + Integer.parseInt(vals2[2]);
return reducedJobStat + " " + Integer.toString(outCount) + " " + Integer.toString(lbdCount);
}
} ).mapValues(new Function<String, String>() {
@Override
public String call(String s) throws Exception {
String jobStatus = s.split(" ")[0];
if (jobStatus.equals("YES")
|| jobStatus.equals("DEFAULT") {
return "YES" + " " + s.split(" ")[1] + " " + s.split(" ")[2];
} else {
return "NO" + " " + s.split(" ")[1] + " " + s.split(" ")[2];
}
}
});