我有两个rdd,一个像:
['group1', 'group2', 'group3', 'group4']
另一个rdd,如:
def get_diff(grp):
grp = grp.groupby('fruits').agg(sum)['amount'].values
return grp[0] - grp[1]
df.groupby('date').apply(get_diff)
如何选择第二个不包含的第一个rdd
答案 0 :(得分:1)
一种方法是将RDD转换为DataFrame并应用var HMAC = request.getHeader('authorization');
var secret = gs.base64Decode('rZ84WrEZ');
var body = JSON.stringify(request.body.data);
var hash = CryptoJS.HmacSHA256(encode(body), encode(secret)));
var base64 = CryptoJS.enc.Base64.stringify(hash);
//Read online that JS string is stored as utf-16, so i use the function below to turn it into utf-8
function encode (string) {
string = string.replace(/\r\n/g,"\n");
var utftext = "";
for (var n = 0; n < string.length; n++) {
var c = string.charCodeAt(n);
if (c < 128) {
utftext += String.fromCharCode(c);
}
else if((c > 127) && (c < 2048)) {
utftext += String.fromCharCode((c >> 6) | 192);
utftext += String.fromCharCode((c & 63) | 128);
}
else {
utftext += String.fromCharCode((c >> 12) | 224);
utftext += String.fromCharCode(((c >> 6) & 63) | 128);
utftext += String.fromCharCode((c & 63) | 128);
}
}
return utftext;
}
连接,如下所示:
left_anti
如果要使用RDD中的结果数据集,只需将val rdd1 = sc.parallelize(Seq(
("39250E6B-DB60-496E-8770-225EDB29A85F",0,"ar","2018-10-09 00:00:00.0","2018-10-09 00:00:04.0","United Arab Emirates"),
("2b98d4f4-0c55-4906-82ec-cfc7c5380652",40967837,"en","2018-10-09 00:00:01.0","2018-10-09 00:00:31.0","Qatar"),
("5bb587bc-54a0-4873-b0ba-2da38458ba1c",0,"en","2018-10-09 00:00:03.0","2018-10-09 00:04:02.0","United Arab Emirates"),
("466B96B5-DC12-4A35-8865-3A8702037A23",0,"ar","2018-10-09 00:00:04.0","2018-10-09 00:00:06.0","Saudi Arabia")
))
val rdd2 = sc.parallelize(Seq(
("39250E6B-DB60-496E-8770-225EDB29A85F"),
("2b98d4f4-0c55-4906-82ec-cfc7c5380652")
))
val dfResult = rdd1.toDF.join(rdd2.toDF("_1"), Seq("_1"), "left_anti")
dfResult.show
// +--------------------+---+---+--------------------+--------------------+--------------------+
// | _1| _2| _3| _4| _5| _6|
// +--------------------+---+---+--------------------+--------------------+--------------------+
// |5bb587bc-54a0-487...| 0| en|2018-10-09 00:00:...|2018-10-09 00:04:...|United Arab Emirates|
// |466B96B5-DC12-4A3...| 0| ar|2018-10-09 00:00:...|2018-10-09 00:00:...| Saudi Arabia|
// +--------------------+---+---+--------------------+--------------------+--------------------+
应用于连接的DataFrame:
rdd
另一种方法是将RDD转换为PairRDD,并应用val rddResult = dfResult.rdd
来过滤掉具有公共键的任何行:
leftOuterJoin
[更新]
按照@Archer的评论,val rddKV1 = rdd1.map(r => (r._1, (r._2, r._3, r._4, r._5, r._6)))
val rddKV2 = rdd2.map(r => (r, 1))
val rddResult = rddKV1.leftOuterJoin(rddKV2).
filter(r => r._2._2 == None).
map{ case (k, (v, _)) => (k, v._1, v._2, v._3, v._4, v._5) }
似乎是采用PairRDD转换方法时最直接的解决方案:
subtractByKey