我有两个数据框。
countryDF
+-------+-------------------+--------+---------+
| id | CountryName |Latitude|Longitude|
+-------+-------------------+--------+---------+
| 1 | United States | 39.76 | -98.5 |
| 2 | China | 35 | 105 |
| 3 | India | 20 | 77 |
| 4 | Brazil | -10 | -55 |
...
+-------+-------------------+--------+---------+
salesDF
+-------+-------------------+--------+---------+--------+
| id | Country |Latitude|Longitude|revenue |
+-------+-------------------+--------+---------+--------+
| 1 | Japan | | | 11 |
| 2 | China | | | 12 |
| 3 | Brazil | | | 56 |
| 4 | Scotland | | | 12 |
...
+-------+-------------------+--------+---------+--------+
任务是为salesDF生成纬度和经度。这将从countryDF列“ CountryName”中搜索salesDF列“ Country”的每个单元格。如果发现一行,则在其后附加相应的“纬度”和“经度”。
输出数据帧为:
+-------+-------------------+--------+---------+---------+
| id | CountryName |Latitude|Longitude|revenue |
+-------+-------------------+--------+---------+---------+
| 1 | Japan | 35.6 | 139 | 11 |
| 2 | China | 35 | 105 | 12 |
| 3 | Brazil | -10 | -55 | 56 |
| 4 | Scotland | 55.95 | -3.18 | 12 |
...
+-------+-------------------+--------+---------+---------+
我编写了一个map函数来进行操作。但是似乎映射函数无法访问外部数据框变量。有解决方案吗?
val countryDF = spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv("Country.csv")
var revenueDF = spark.read
.option("inferSchema", "true")
.option("header", "true")
.csv("revenue.csv")
var resultRdd = revenueDF.rdd.map(row => {
val generateRow = (row: Row, latitude: Any, longitude: Any, latidudeIndex: Int, longitudeIndex: Int) => {
val arr = row.toSeq.toArray
arr(latidudeIndex) = latitude
arr(longitudeIndex) = longitude
Row.fromSeq(arr)
}
val countryName = row.getAs[String](1)
// cannot access countryDF, it is corrupted
val countryRow = countryDF.where(col("CountryName") === countryName)
generateRow(row, row.getAs[String](2), row.getAs[String](3),2, 3)
})
revenueDF.sqlContext.createDataFrame(resultRdd, revenueDF.schema).show()
答案 0 :(得分:0)
您要查找的操作是join
salesDF.select("id", "Country").join(
countryDF.select("CountryName", "Latitude", "Longitude")
$"CountryName" === $"Country",
"left"
).drop("Country")
不,您不能使用DataFrames
,RDD
或等效版本中的map
,udf
和其他分布式对象。