如何使用scala比较两个RDD?

时间:2016-09-21 19:23:46

标签: java scala apache-spark rdd

我有两个RDD,如RDD[(String, String, DataTime, Int, Array[Byte])],让我们调用Rdd1Rdd2,我想在Rdd1Rdd2之间进行比较每个元组值和如果不匹配则存储在新的Rdd中。我在Scala上使用Spark。

例如: 考虑Rdd的输出是:

 data = ((student1,XII,2016-09-11T00:00:00.000Z,1,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120001), 
         (student1,XII,2016-09-11T00:00:00.000Z,2,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120002),
         (student2,XII,2016-09-12T00:00:00.000Z,2,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120004),
         (student3,XII,2016-09-13T00:00:00.000Z,4,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120005)) 
 data2 = ((student1,XII,2016-09-11T00:00:00.000Z,1,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120001),
          (student1,XII,2016-09-11T00:00:00.000Z,2,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120002),
          (student2,XII,2016-09-12T00:00:00.000Z,2,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120004))

并将分区键视为

case class Student(name: String, class: String, dob: DateTime)

我想检查rdd1中的rdd2中的每个条目(以及所有字段值)是否应该存在于 val resultRdd = ((student3,XII,2016-09-13T00:00:00.000Z,4,0x0a130a0942553030303730333510ba1118f92120000a130a0942553030303730333510ba1118f92120005)) 中,如果不存在则存储在新的Rdd中。 在上面的例子中输出将是:

DataSet CustomColumnsDS = new DataSet();
DataTable dt = new DataTable();
string strXML = GetCatalog(WebUserID, Password); //Web Service Call
XmlDocument doc = new XmlDocument();doc.LoadXml(strXML);
XmlNodeList xnList = doc.SelectNodes("xml/Catalog/item/Package");

if (xnList.Count > 0)//Count = 90
{
dt.Columns.Add("testId", typeof(string));
dt.Columns.Add("testName", typeof(string));

foreach (XmlNode xn in xnList) 
{
    if (!string.IsNullOrEmpty(xn["Id"].InnerText))
    {
        DataRow dr = dt.NewRow();
        dr["testId"] = xn["Id"].InnerText;
        dr["testName"] = xn["Name"].InnerText;
        try
        {
            //At this point the DataRow is filled in with values, but it does not seem to actually add in.
            dt.Rows.Add(dr); //No Exception is caught
        }
        catch (Exception ex)
        {
            string test = "";
        }
    }
}
CustomColumnsDS.Tables.Add(dt);//Count = 0;
}

0 个答案:

没有答案