我想在csv文件中删除常量列我做了这个源代码,但是当数据类型是Timestamp时,显然这个代码通过异常,例如当它不是数字时我认为我做了什么,是否有任何帮助所以source被推广为适用于所有类型的输入谢谢
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df=spark.read.option("header", true).option("inferSchema", true).csv(strPath)
df.printSchema()
val aggregations = df.columns.map(c => stddev(s"`${c}`").as(c))
val df2 = df.agg(aggregations.head, aggregations.tail: _*)
df2.printSchema()
val columnsToKeep: Seq[String] = (df2.first match {
case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
}).zip(df.columns)
.filter(_._1 != 0) // your special condition is in the filter
.map(_._2) // keep just the name of the column
val column=columnsToKeep.map(f=>s"`${f}`")
val finalResult = df.select(column.head, column.tail : _*)
finalResult.printSchema()
这是一个数据示例,例如timestamp列导致问题
time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,3
2015-12-08 12:43:22,4,5,3
2015-12-09 12:47:55,5,7,3
预期结果是
time.1,col.1,col.2
2015-12-06 12:40:00,2,2
2015-12-07 12:41:35,3,3
2015-12-08 12:43:22,4,5
2015-12-09 12:47:55,5,7
因为col.3没有改变。
答案 0 :(得分:1)
删除常量列:如何使用countDistinct
:
val to_keep = df.columns.collect{ case col if df.agg(countDistinct(s"`${col}`")).first().getAs[Long](0) > 1 => s"`${col}`" }
// to_keep: Array[String] = Array(`time.1`, `col.1`, `col.2`)
df.select(to_keep.head, to_keep.tail: _*).show
+--------------------+-----+-----+
| time.1|col.1|col.2|
+--------------------+-----+-----+
|2015-12-06 12:40:...| 2| 2|
|2015-12-07 12:41:...| 3| 3|
|2015-12-08 12:43:...| 4| 5|
|2015-12-09 12:47:...| 5| 7|
+--------------------+-----+-----+
首先构建聚合:
val aggs = df.columns.map(col => countDistinct(s"`${col}`").as(col))
// aggs: Array[org.apache.spark.sql.Column] = Array(count(DISTINCT `time.1`) AS `time.1`, count(DISTINCT `col.1`) AS `col.1`, count(DISTINCT `col.2`) AS `col.2`, count(DISTINCT `col.3`) AS `col.3`)
val count_map = df.agg(aggs.head, aggs.tail: _*).first().getValuesMap[Long](df.columns)
// count_map: Map[String,Long] = Map(time.1 -> 4, col.1 -> 4, col.2 -> 4, col.3 -> 1)
val to_keep = df.columns.collect{case col if count_map(col) > 1 => s"`${col}`"}
// to_keep: Array[String] = Array(`time.1`, `col.1`, `col.2`)
df.select(to_keep.head, to_keep.tail: _*).show
+--------------------+-----+-----+
| time.1|col.1|col.2|
+--------------------+-----+-----+
|2015-12-06 12:40:...| 2| 2|
|2015-12-07 12:41:...| 3| 3|
|2015-12-08 12:43:...| 4| 5|
|2015-12-09 12:47:...| 5| 7|
+--------------------+-----+-----+