如何在rdd地图功能中访问外部数据框?

时间:2019-03-13 09:38:51

标签: scala apache-spark dataframe rdd

我有两个数据框。

countryDF

+-------+-------------------+--------+---------+
|   id  |    CountryName    |Latitude|Longitude|
+-------+-------------------+--------+---------+
|  1    | United States     |  39.76 |   -98.5 |
|  2    | China             |  35    |   105   |
|  3    | India             |  20    |   77    |
|  4    | Brazil            |  -10   |   -55   |
...
+-------+-------------------+--------+---------+

salesDF

+-------+-------------------+--------+---------+--------+
|   id  |    Country        |Latitude|Longitude|revenue |
+-------+-------------------+--------+---------+--------+
|  1    | Japan             |        |         |   11   |
|  2    | China             |        |         |   12   |
|  3    | Brazil            |        |         |   56   |
|  4    | Scotland          |        |         |   12   |
...
+-------+-------------------+--------+---------+--------+

任务是为salesDF生成纬度和经度。这将从countryDF列“ CountryName”中搜索salesDF列“ Country”的每个单元格。如果发现一行,则在其后附加相应的“纬度”和“经度”。

输出数据帧为:

+-------+-------------------+--------+---------+---------+
|   id  |    CountryName    |Latitude|Longitude|revenue  |
+-------+-------------------+--------+---------+---------+
|  1    | Japan             |  35.6  |   139   | 11      |
|  2    | China             |  35    |   105   | 12      |
|  3    | Brazil            |  -10   |   -55   | 56      |
|  4    | Scotland          |  55.95 |  -3.18  | 12      |
...
+-------+-------------------+--------+---------+---------+

我编写了一个map函数来进行操作。但是似乎映射函数无法访问外部数据框变量。有解决方案吗?

val countryDF = spark.read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("Country.csv")

var revenueDF = spark.read
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("revenue.csv")

var resultRdd = revenueDF.rdd.map(row => {
  val generateRow = (row: Row, latitude: Any, longitude: Any, latidudeIndex: Int, longitudeIndex: Int) => {
    val arr = row.toSeq.toArray
    arr(latidudeIndex) = latitude
    arr(longitudeIndex) = longitude
    Row.fromSeq(arr)
  }
  val countryName = row.getAs[String](1)
  // cannot access countryDF, it is corrupted
  val countryRow = countryDF.where(col("CountryName") === countryName)
  generateRow(row, row.getAs[String](2), row.getAs[String](3),2, 3)

})
revenueDF.sqlContext.createDataFrame(resultRdd, revenueDF.schema).show()

1 个答案:

答案 0 :(得分:0)

您要查找的操作是join

salesDF.select("id", "Country").join(
  countryDF.select("CountryName", "Latitude", "Longitude")
  $"CountryName" === $"Country",
  "left"
).drop("Country")

不,您不能使用DataFramesRDD或等效版本中的mapudf和其他分布式对象。