Spark DataFrame中数组类型列的摘要统计信息

时间:2017-02-25 08:00:56

标签: scala apache-spark spark-dataframe

def toDouble(s: String) = {
  if ("?".equals(s)) Double.NaN else s.toDouble
}
  def parse(line: String) = {
  val pieces = line.split(',')
  val id1 = pieces(0).toInt 
  val scores = pieces.slice(2, 11).map(toDouble)
  val matched = pieces(11).toBoolean
  MatchData(id1, scores, matched)
}
case class MatchData
(
    id1: Int, 
    scores: Array[Double], 
    matched: Boolean
    )
val inputrdd = spark.sparkContext.textFile("../donation/block_*.csv")

    val noheader = inputrdd.filter(x => !x.contains("id_1"))
    val df= noheader.map(line => parse(line)).toDF()

数据框的架构如下

root    
|-- id1: integer (nullable = true)    
|-- scores: array (nullable = true)    
|-- element: double (containsNull = false)   
|-- matched: boolean (nullable = true)

前三个记录如下

 [53113,WrappedArray(0.833333333333333, NaN, 1.0, NaN, 1.0, 1.0, 1.0, 1.0, 0.0),true] 
 [47614,WrappedArray(1.0, NaN, 1.0, NaN, 1.0, 1.0, 1.0, 1.0, 1.0),true]  
 [70237,WrappedArray(1.0, NaN, 1.0, NaN, 1.0, 1.0, 1.0, 1.0, 1.0),true]

我想获取Wrapped Array列中每个元素的count,mean,max,min等摘要统计信息。

我的想法是通过过滤掉NaN值并将列别名作为数组元素的索引来单独为WrappedArray元素创建另一个DataFrame。 在count(),min()max()等函数的DataFrame上使用select。 但并非没有任何结果。

val dfnona = (0 until 9).map(i => {
  df.select("scores").as[Seq[Double]].filter(s=> s(i) !=Double.NaN).alias(i.toString())
})

dfnona.select(count("0"),mean("0"), stddev_pop ("0"),max("0"),min("0")).show()

有人可以就如何实现这个目的给我一些指示。

2 个答案:

答案 0 :(得分:1)

如果scores数组是整个DataSet的固定长度,那么您可以使用此解决方案。

val df = ... //create dataframe with schema you mentioned 

//Here $"scores"(0) fetches first element in Scores Array
val subjects = df.withColumn("sub1", $"scores"(0))
  .withColumn("sub2", $"scores"(1))
  .withColumn("sub3", $"scores"(2))
  .withColumn("sub4", $"scores"(3))
  .select("sub1", "sub2", "sub3", "sub4")

// Alternate approach
val numberOfSubjects = 4
val subjects = (0 until numberOfSubjects).foldLeft(df)((accDf, index) => {
  accDf.withColumn(s"sub${index}", $"scores" (index))
})


subjects.printSchema()

root
 |-- sub1: double (nullable = true)
 |-- sub2: double (nullable = true)
 |-- sub3: double (nullable = true)
 |-- sub4: double (nullable = true)

现在,您可以在sub1, sub2, sub3, sub4列上应用所有统计函数。

答案 1 :(得分:0)

这是我提出的代码。

This post helped.和mrsrinivas也是如此。

<cache vary-by-route="param">Time Inside Cache Tag Helper : @DateTime.Now</cache>