如何使用python将两个RDD加入spark中?

时间:2015-06-22 20:12:20

标签: apache-spark join pyspark

假设

if (requestModel != null && requestModel.Response != null)
{
    PSObject responseObject = new PSObject();

    responseObject.Members.Add(new PSNoteProperty("SiteID", requestModel.Response.SiteID));
    responseObject.Members.Add(new PSNoteProperty("Identity", requestModel.Response.SiteName));
    // and so on etc... 
    responseObject.Members.Add(new PSNoteProperty("Latitude", requestModel.Response.latitude));

    this.WriteObject(responseObject);
}

想要生成

rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).

任何简单的方法? 我认为它与交叉连接不同但无法找到一个好的解决方案。 我的解决方案是

( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).

1 个答案:

答案 0 :(得分:12)

您只是在寻找一个简单的连接,例如

rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]