我使用Spark 1.3.1,其中连接两个数据帧重复列
加入。我离开了外部连接两个数据帧并希望发送
结果数据帧到na().fill()
方法,以将空值转换为已知
基于列的数据类型的值。我已经建了一张地图
" TABLE.COLUMN" - > "值"并将其传递给fill方法。但我明白了
异常而不是成功:(。我有什么选择?我看到有一个dataFrame.withColumnRenamed方法,但我只能重命名一列。我有连接涉及多个列。我只需要确保有一个一组唯一的列名,无论dataFrame中的表别名,我应用na()。fill()方法?
假设:
scala> val df1 = sqlContext.jsonFile("people.json").as("df1")
df1: org.apache.spark.sql.DataFrame = [first: string, last: string]
scala> val df2 = sqlContext.jsonFile("people.json").as("df2")
df2: org.apache.spark.sql.DataFrame = [first: string, last: string]
我可以和
一起加入他们val df3 = df1.join(df2, df1("first") === df2("first"), "left_outer")
我有一张将数据类型转换为值的地图。
scala> val map = Map("df1.first"->"unknown", "df1.last" -> "unknown",
"df2.first" -> "unknown", "df2.last" -> "unknown")
但执行fill(map)会导致异常。
scala> df3.na.fill(map)
org.apache.spark.sql.AnalysisException: Reference 'first' is ambiguous,
could be: first#6, first#8.;
答案 0 :(得分:3)
这是我想出的。在我的原始示例中,在连接之后df2中没有任何有趣的内容,所以我将其更改为经典的部门/员工示例。
department.json
{"department": 2, "name":"accounting"}
{"department": 1, "name":"engineering"}
person.json
{"department": 1, "first":"Bruce", "last": "szalwinski"}
现在我可以加入数据帧,构建地图,并用未知数替换空值。
scala> val df1 = sqlContext.jsonFile("department.json").as("df1")
df1: org.apache.spark.sql.DataFrame = [department: bigint, name: string]
scala> val df2 = sqlContext.jsonFile("people.json").as("df2")
df2: org.apache.spark.sql.DataFrame = [department: bigint, first: string, last: string]
scala> val df3 = df1.join(df2, df1("department") === df2("department"), "left_outer")
df3: org.apache.spark.sql.DataFrame = [department: bigint, name: string, department: bigint, first: string, last: string]
scala> val map = Map("first" -> "unknown", "last" -> "unknown")
map: scala.collection.immutable.Map[String,String] = Map(first -> unknown, last -> unknown)
scala> val df4 = df3.select("df1.department", "df2.first", "df2.last").na.fill(map)
df4: org.apache.spark.sql.DataFrame = [department: bigint, first: string, last: string]
scala> df4.show()
+----------+-------+----------+
|department| first| last|
+----------+-------+----------+
| 2|unknown| unknown|
| 1| Bruce|szalwinski|
+----------+-------+----------+