如何从Scala

时间:2016-11-29 10:27:11

标签: scala apache-spark dataframe apache-spark-sql

如何从一系列数据帧中删除空数据帧?在下面的代码片段中,twoColDF中有许多空数据帧。以下for循环的另一个问题是,有一种方法可以使这个效率更高吗?我尝试将其重写为以下行,但无法正常工作

//finalDF2 = (1 until colCount).flatMap(j => groupCount(j).map( y=> finalDF.map(a=>a.filter(df(cols(j)) === y)))).toSeq.flatten

   var twoColDF: Seq[Seq[DataFrame]] = null
if (colCount == 2  ) 
{
  val i = 0
  for (j <- i + 1 until colCount) {

      twoColDF = groupCount(j).map(y => {
      finalDF.map(x => x.filter(df(cols(j)) === y))

    })

  }
}finalDF = twoColDF.flatten

1 个答案:

答案 0 :(得分:1)

给定一组DataFrame,您可以访问每个DataFrame的基础RDD,并使用isEmpty过滤掉空的RDD:

val input: Seq[DataFrame] = ???
val result = input.filter(!_.rdd.isEmpty())

至于你的其他问题 - 我无法理解你的代码尝试做什么,但我首先尝试将其转换为更多功能(删除var的使用和必要的条件)。如果我猜测您输入的含义,这里的内容可能与您尝试的内容相同:

var input: Seq[DataFrame] = ???

// map of column index to column values -
// for each combination we'd want a new DF where that column has that value
// I'm assuming values are Strings, can be anything else
val groupCount: Map[Int, Seq[String]] = ???

// for each combination of DF + column + value - produce the filtered DF where this column has this value
val perValue: Seq[DataFrame] = for {
  df <- input
  index <- groupCount.keySet
  value <- groupCount(index)
} yield df.filter(col(df.columns(index)) === value)

// remove empty results:
val result: Seq[DataFrame] = perValue.filter(!_.rdd.isEmpty())