如何使用Scala从Spark DataFrame的列名称中删除空格?
例如,我具有列名“ Type
”,“ Device ID
”和“ Office Address
”。我需要获取“ Type
”,“ DeviceID
”和“ OfficeAddress
”
答案 0 :(得分:1)
您可以使用下面完整示例介绍的selectExpr
或withColumn
方法:
使用select expr时,您必须使用这样的列名
"`Device ID` as DeviceId", "`Office Address` as OfficeAddress"
println("selectExpr approach")
val basedf = Seq(
(1, "100abcd", "8100 Memorial Ln Plano Texas")
, (0, "100abcd1", "8100 Memorial Ln Plano Texas")
, (0, "100abcd2", "8100 Memorial Ln Plano Texas")
, (1, "100abcd2", "8100 Memorial Ln Plano Texas")
, (1, "100abcd2", "8100 Memorial Ln Plano Texas")
).toDF("Type", "Device ID", "Office Address")
basedf.show(false)
basedf.selectExpr("Type as type", "`Device ID` as DeviceId", "`Office Address` as OfficeAddress").show(false)
// second exaample
println("with column approach")
val df1 = basedf
.withColumn("DeviceID", $"Device Id")
.withColumn("OfficeAddress", $"Office Address")
.drop("Device Id", "Office Address")
df1.show(false)
结果:
selectExpr approach
+----+---------+----------------------------+
|Type|Device ID|Office Address |
+----+---------+----------------------------+
|1 |100abcd |8100 Memorial Ln Plano Texas|
|0 |100abcd1 |8100 Memorial Ln Plano Texas|
|0 |100abcd2 |8100 Memorial Ln Plano Texas|
|1 |100abcd2 |8100 Memorial Ln Plano Texas|
|1 |100abcd2 |8100 Memorial Ln Plano Texas|
+----+---------+----------------------------+
+----+--------+----------------------------+
|type|DeviceId|OfficeAddress |
+----+--------+----------------------------+
|1 |100abcd |8100 Memorial Ln Plano Texas|
|0 |100abcd1|8100 Memorial Ln Plano Texas|
|0 |100abcd2|8100 Memorial Ln Plano Texas|
|1 |100abcd2|8100 Memorial Ln Plano Texas|
|1 |100abcd2|8100 Memorial Ln Plano Texas|
+----+--------+----------------------------+
with column approach
+----+--------+----------------------------+
|Type|DeviceID|OfficeAddress |
+----+--------+----------------------------+
|1 |100abcd |8100 Memorial Ln Plano Texas|
|0 |100abcd1|8100 Memorial Ln Plano Texas|
|0 |100abcd2|8100 Memorial Ln Plano Texas|
|1 |100abcd2|8100 Memorial Ln Plano Texas|
|1 |100abcd2|8100 Memorial Ln Plano Texas|
+----+--------+----------------------------+
不管哪种列名都有空格的通用方式如下所示……
println("Generic column rename approach for n number of Columns")
basedf.printSchema()
var newDf: DataFrame = basedf
newDf.columns.foreach { col =>
println(col + " after column replace " + col.replaceAll(" ", ""))
newDf = newDf.withColumnRenamed(col, col.replaceAll(" ", "")
)
}
newDf.printSchema()
newDf.show(false)
结果:
Generic column rename approach for ***n*** number of Columns
root
|-- Type: integer (nullable = false)
|-- Device ID: string (nullable = true)
|-- Office Address: string (nullable = true)
Type after column replace Type
Device ID after column replace DeviceID
Office Address after column replace OfficeAddress
root
|-- Type: integer (nullable = false)
|-- DeviceID: string (nullable = true)
|-- OfficeAddress: string (nullable = true)
+----+--------+----------------------------+
|Type|DeviceID|OfficeAddress |
+----+--------+----------------------------+
|1 |100abcd |8100 Memorial Ln Plano Texas|
|0 |100abcd1|8100 Memorial Ln Plano Texas|
|0 |100abcd2|8100 Memorial Ln Plano Texas|
|1 |100abcd2|8100 Memorial Ln Plano Texas|
|1 |100abcd2|8100 Memorial Ln Plano Texas|
+----+--------+----------------------------+
结论 :
在所有这三种方法中,我将首选通用方法,因为如果 您有大量的列,它可以有效地处理重命名 没有打