代码后
val df = spark.sql(sql_query)
df.show()
我得到一个架构
// +--------+-------+
// | id_card| year|
// +--------+-------+
// |001_1976| 2017 |
// |015_1983| 2012 |
// |078_1963| 2011 |
// +--------+-------+
然后我想要一个名为"work_year"
(year - id_card.substring(4,8))
的新列
我已阅读有关withColumn()
的源代码,我注意到withColumn
的列参数必须是org.apache.spark.sqlColumn
,而不是简单字符串,它会让我感到不安。
spark version: Spark 2.1.0
scala version: 2.12.1
jdk version: 1.8
答案 0 :(得分:1)
您可以使用数据框df上的withColumn
函数和udf来完成此操作。
import org.apache.spark.sql.functions.udf
val df = sc.parallelize((Seq(("001_1976", 2017),("015_1983", 2012),("078_1963", 2011)))).toDF("c1", "c2")
val work_year = udf((x: String) => x.substring(4,8))
scala> df.withColumn("work_year", work_year($"c1")).show()
+--------+----+---------+
| c1| c2|work_year|
+--------+----+---------+
|001_1976|2017| 1976|
|015_1983|2012| 1983|
|078_1963|2011| 1963|
+--------+----+---------+
或使用spark-sql,如下所示
df.registerTempTable("temp_table")
scala> spark.sql("SELECT c1,c2, substring(c1,5,8) from temp_table").show()
+--------+----+-------------------+
| c1| c2|substring(c1, 5, 8)|
+--------+----+-------------------+
|001_1976|2017| 1976|
|015_1983|2012| 1983|
|078_1963|2011| 1963|
+--------+----+-------------------+
答案 1 :(得分:1)
扩展到@ rogue-one答案
OP询问work_year = (year - id_card.substring(4,8))
那么udf应该是
val work_year = udf((x: String, y: Int) => y - x.substring(4,8).toInt)
df.withColumn("work_year", work_year($"id_card", $"year")).show()
输出:
+--------+----+---------+
| id_card|year|work_year|
+--------+----+---------+
|001_1976|2017| 41|
|015_1983|2012| 29|
|078_1963|2011| 48|
+--------+----+---------+