如何遍历数据框中的多个Col值以获取计数

时间:2018-03-28 15:59:48

标签: scala apache-spark apache-spark-sql spark-dataframe rdd

我有一个表的列表,比如x,y,z,每个表都有一些cols,例如test x,test1,test2,test3 for table x。就像我们有像表格的rem,rem1,rem2这样的cols。表z的情况也是如此。现在要求我们必须循环遍历表中的每个col,并且必须根据以下场景得到行计数。

  • 如果test不是NULL而其他所有都是NULL(test1,test2,test3)那么它将是一个计数。

现在我们必须循环遍历每个表,然后找到像test *这样的cols然后匹配上面的条件,然后将该行标记为1,如果它满足上述条件则为1。

我对scala很新,但我想到了以下方法。

for each $tablename{
{
Val df = sql("select * from $tablename ")
val coldf = df.select(df.columns.filter(_.startsWith("test")).map(df(_)) : _*)
val df_filtered = coldf.map(eachrow ->df.filter(s$"eachrow".isNull))

}
}
}

它不适合我,也不知道在哪里放置计数变量。如果有人可以提供帮助,我将非常感激。 我使用火花2和scala。

代码更新

下面是生成表列表和表格列表映射列表的代码。

val table_names = sql("SELECT t1.Table_Name ,t1.col_name FROM table_list t1 LEFT JOIN db_list t2 ON t2.tableName == t1.Table_Name WHERE t2.tableName IS NOT NULL ").toDF("tabname", "colname")

//List of all tables in the db as list of df

val dfList = table_names.select("tabname").map(r => r.getString(0)).collect.toList
val dfTableList = dfList.map(spark.table)



 //Mapping each table with keycol
val tabColPairList = table_names.rdd.map( r => (r(0).toString, r(1).toString)).collect
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap

在此之后我使用以下方法..

def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {

  if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
    val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
      (acc, i) => if (row.isNullAt(i)) acc else acc + 1
    )
    if (nonNulls == 0) 1 else 0
  }
}

val dfCountList = dfTableList.map{ df =>
df.cols
  val keyCol = dfColMap(df)
  //println(keyCol)
  val colPattern = s"$keyCol\\d+".r
  val checkCols = df.columns.map( c => c match {
    case colPattern() => Some(c)
    case _ => None
  } ).flatten

  val rddWithCount = df.rdd.map{ case r: Row =>
    Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
  }
  spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))

它给我以下错误:

createCount: (row: org.apache.spark.sql.Row, keyCol: String, checkCols: Seq[String])Int
java.util.NoSuchElementException: key not found: [id: string, projid: string ... 40 more fields]
  at scala.collection.MapLike$class.default(MapLike.scala:228)
  at scala.collection.AbstractMap.default(Map.scala:59)
  at scala.collection.MapLike$class.apply(MapLike.scala:141)
  at scala.collection.AbstractMap.apply(Map.scala:59)
  at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:121)
  at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:120)
  at scala.collection.immutable.List.map(List.scala:273)
  ... 78 elided

1 个答案:

答案 0 :(得分:0)

根据您的要求,我建议您利用RDD的功能,并使用Row-based方法为每个DataFrame的每一行创建count

val dfX = Seq(
  ("a", "ma", "a1", "a2", "a3"),
  ("b", "mb", null, null, null),
  ("null", "mc", "c1", null, "c3")
).toDF("xx", "mm", "xx1", "xx2", "xx3")

val dfY = Seq(
  ("nd", "d", "d1", null),
  ("ne", "e", "e1", "e2"),
  ("nf", "f", null, null)
).toDF("nn", "yy", "yy1", "yy2")

val dfZ = Seq(
  ("g", null, "g1", "g2", "qg"),
  ("h", "ph", null, null, null),
  ("i", "pi", null, null, "qi")
).toDF("zz", "pp", "zz1", "zz2", "qq")

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.IntegerType

val dfList = List(dfX, dfY, dfZ)
val dfColMap = Map(dfX -> "xx", dfY -> "yy", dfZ -> "zz")

def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
  if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
    val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
      (acc, i) => if (row.isNullAt(i)) acc else acc + 1
    )
    if (nonNulls == 0) 1 else 0
  }
}

val dfCountList = dfList.map{ df =>
  val keyCol = dfColMap(df)
  val colPattern = s"$keyCol\\d+".r
  val checkCols = df.columns.map( c => c match {
    case colPattern() => Some(c)
    case _ => None
  } ).flatten

  val rddWithCount = df.rdd.map{ case r: Row =>
    Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
  }
  spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
}

dfCountList(0).show
// +----+---+----+----+----+-----+
// |  xx| mm| xx1| xx2| xx3|count|
// +----+---+----+----+----+-----+
// |   a| ma|  a1|  a2|  a3|    0|
// |   b| mb|null|null|null|    1|
// |null| mc|  c1|null|  c3|    0|
// +----+---+----+----+----+-----+

dfCountList(1).show
// +---+---+----+----+-----+
// | nn| yy| yy1| yy2|count|
// +---+---+----+----+-----+
// | nd|  d|  d1|null|    0|
// | ne|  e|  e1|  e2|    0|
// | nf|  f|null|null|    1|
// +---+---+----+----+-----+

dfCountList(2).show
// +---+----+----+----+----+-----+
// | zz|  pp| zz1| zz2|  qq|count|
// +---+----+----+----+----+-----+
// |  g|null|  g1|  g2|  qg|    0|
// |  h|  ph|null|null|null|    1|
// |  i|  pi|null|null|  qi|    1|
// +---+----+----+----+----+-----+

[UPDATE]

请注意,上述解决方案适用于任意数量的DataFrame,只要您在dfList及其dfColMap中的相应键列中拥有它们。

如果你有一个Hive表的列表,只需使用spark.table()将它们转换为DataFrames,如下所示:

val tableList = List("tableX", "tableY", "tableZ")
val dfList = tableList.map(spark.table)
// dfList: List[org.apache.spark.sql.DataFrame] = List(...)

现在你仍然必须告诉Spark每个表的键列是什么。我们假设您的列表与表格列表的顺序相同。您可以zip列表创建dfColMap,并且您将拥有应用上述解决方案所需的一切:

val keyColList = List("xx", "yy", "zz")
val dfColMap = dfList.zip(keyColList).toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)

[更新#2]

如果您将Hive表名称及其对应的键列名存储在DataFrame中,则可以按如下方式生成dfColMap

val dfTabColPair = Seq(
  ("tableX", "xx"),
  ("tableY", "yy"),
  ("tableZ", "zz")
).toDF("tabname", "colname")

val tabColPairList = dfTabColPair.rdd.map( r => (r(0).toString, r(1).toString)).
  collect
// tabColPairList: Array[(String, String)] = Array((tableX,xx), (tableY,yy), (tableZ,zz))

val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)