Drop Constant列非数字数据

时间:2017-04-21 13:08:21

标签: scala apache-spark apache-spark-sql spark-dataframe

我想在csv文件中删除常量列我做了这个源代码,但是当数据类型是Timestamp时,显然这个代码通过异常,例如当它不是数字时我认为我做了什么,是否有任何帮助所以source被推广为适用于所有类型的输入谢谢

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
     val df=spark.read.option("header", true).option("inferSchema", true).csv(strPath)
        df.printSchema()

     val aggregations = df.columns.map(c => stddev(s"`${c}`").as(c))
    val df2 = df.agg(aggregations.head, aggregations.tail: _*)
    df2.printSchema()
    val columnsToKeep: Seq[String] = (df2.first match {
      case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
    }).zip(df.columns)
      .filter(_._1 != 0) // your special condition is in the filter
      .map(_._2) // keep just the name of the column
     val column=columnsToKeep.map(f=>s"`${f}`")
    val finalResult = df.select(column.head, column.tail : _*)
    finalResult.printSchema()

这是一个数据示例,例如timestamp列导致问题

time.1,col.1,col.2,col.3
2015-12-06 12:40:00,2,2,3
2015-12-07 12:41:35,3,3,3
2015-12-08 12:43:22,4,5,3
2015-12-09 12:47:55,5,7,3

预期结果是

time.1,col.1,col.2
2015-12-06 12:40:00,2,2
2015-12-07 12:41:35,3,3
2015-12-08 12:43:22,4,5
2015-12-09 12:47:55,5,7

因为col.3没有改变。

1 个答案:

答案 0 :(得分:1)

删除常量列:如何使用countDistinct

val to_keep = df.columns.collect{ case col if df.agg(countDistinct(s"`${col}`")).first().getAs[Long](0) > 1 => s"`${col}`" }
// to_keep: Array[String] = Array(`time.1`, `col.1`, `col.2`)


df.select(to_keep.head, to_keep.tail: _*).show
+--------------------+-----+-----+
|              time.1|col.1|col.2|
+--------------------+-----+-----+
|2015-12-06 12:40:...|    2|    2|
|2015-12-07 12:41:...|    3|    3|
|2015-12-08 12:43:...|    4|    5|
|2015-12-09 12:47:...|    5|    7|
+--------------------+-----+-----+

首先构建聚合:

val aggs = df.columns.map(col => countDistinct(s"`${col}`").as(col))
// aggs: Array[org.apache.spark.sql.Column] = Array(count(DISTINCT `time.1`) AS `time.1`, count(DISTINCT `col.1`) AS `col.1`, count(DISTINCT `col.2`) AS `col.2`, count(DISTINCT `col.3`) AS `col.3`)

val count_map = df.agg(aggs.head, aggs.tail: _*).first().getValuesMap[Long](df.columns)
// count_map: Map[String,Long] = Map(time.1 -> 4, col.1 -> 4, col.2 -> 4, col.3 -> 1)

val to_keep = df.columns.collect{case col if count_map(col) > 1 => s"`${col}`"}
// to_keep: Array[String] = Array(`time.1`, `col.1`, `col.2`)    

df.select(to_keep.head, to_keep.tail: _*).show
+--------------------+-----+-----+
|              time.1|col.1|col.2|
+--------------------+-----+-----+
|2015-12-06 12:40:...|    2|    2|
|2015-12-07 12:41:...|    3|    3|
|2015-12-08 12:43:...|    4|    5|
|2015-12-09 12:47:...|    5|    7|
+--------------------+-----+-----+