我有一个表的列表,比如x,y,z,每个表都有一些cols,例如test x,test1,test2,test3 for table x。就像我们有像表格的rem,rem1,rem2这样的cols。表z的情况也是如此。现在要求我们必须循环遍历表中的每个col,并且必须根据以下场景得到行计数。
现在我们必须循环遍历每个表,然后找到像test *这样的cols然后匹配上面的条件,然后将该行标记为1,如果它满足上述条件则为1。
我对scala很新,但我想到了以下方法。
for each $tablename{
{
Val df = sql("select * from $tablename ")
val coldf = df.select(df.columns.filter(_.startsWith("test")).map(df(_)) : _*)
val df_filtered = coldf.map(eachrow ->df.filter(s$"eachrow".isNull))
}
}
}
它不适合我,也不知道在哪里放置计数变量。如果有人可以提供帮助,我将非常感激。 我使用火花2和scala。
下面是生成表列表和表格列表映射列表的代码。
val table_names = sql("SELECT t1.Table_Name ,t1.col_name FROM table_list t1 LEFT JOIN db_list t2 ON t2.tableName == t1.Table_Name WHERE t2.tableName IS NOT NULL ").toDF("tabname", "colname")
//List of all tables in the db as list of df
val dfList = table_names.select("tabname").map(r => r.getString(0)).collect.toList
val dfTableList = dfList.map(spark.table)
//Mapping each table with keycol
val tabColPairList = table_names.rdd.map( r => (r(0).toString, r(1).toString)).collect
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
在此之后我使用以下方法..
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfTableList.map{ df =>
df.cols
val keyCol = dfColMap(df)
//println(keyCol)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
它给我以下错误:
createCount: (row: org.apache.spark.sql.Row, keyCol: String, checkCols: Seq[String])Int
java.util.NoSuchElementException: key not found: [id: string, projid: string ... 40 more fields]
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:121)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:120)
at scala.collection.immutable.List.map(List.scala:273)
... 78 elided
答案 0 :(得分:0)
根据您的要求,我建议您利用RDD的功能,并使用Row-based
方法为每个DataFrame的每一行创建count
:
val dfX = Seq(
("a", "ma", "a1", "a2", "a3"),
("b", "mb", null, null, null),
("null", "mc", "c1", null, "c3")
).toDF("xx", "mm", "xx1", "xx2", "xx3")
val dfY = Seq(
("nd", "d", "d1", null),
("ne", "e", "e1", "e2"),
("nf", "f", null, null)
).toDF("nn", "yy", "yy1", "yy2")
val dfZ = Seq(
("g", null, "g1", "g2", "qg"),
("h", "ph", null, null, null),
("i", "pi", null, null, "qi")
).toDF("zz", "pp", "zz1", "zz2", "qq")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.IntegerType
val dfList = List(dfX, dfY, dfZ)
val dfColMap = Map(dfX -> "xx", dfY -> "yy", dfZ -> "zz")
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfList.map{ df =>
val keyCol = dfColMap(df)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
}
dfCountList(0).show
// +----+---+----+----+----+-----+
// | xx| mm| xx1| xx2| xx3|count|
// +----+---+----+----+----+-----+
// | a| ma| a1| a2| a3| 0|
// | b| mb|null|null|null| 1|
// |null| mc| c1|null| c3| 0|
// +----+---+----+----+----+-----+
dfCountList(1).show
// +---+---+----+----+-----+
// | nn| yy| yy1| yy2|count|
// +---+---+----+----+-----+
// | nd| d| d1|null| 0|
// | ne| e| e1| e2| 0|
// | nf| f|null|null| 1|
// +---+---+----+----+-----+
dfCountList(2).show
// +---+----+----+----+----+-----+
// | zz| pp| zz1| zz2| qq|count|
// +---+----+----+----+----+-----+
// | g|null| g1| g2| qg| 0|
// | h| ph|null|null|null| 1|
// | i| pi|null|null| qi| 1|
// +---+----+----+----+----+-----+
[UPDATE]
请注意,上述解决方案适用于任意数量的DataFrame,只要您在dfList
及其dfColMap
中的相应键列中拥有它们。
如果你有一个Hive表的列表,只需使用spark.table()
将它们转换为DataFrames,如下所示:
val tableList = List("tableX", "tableY", "tableZ")
val dfList = tableList.map(spark.table)
// dfList: List[org.apache.spark.sql.DataFrame] = List(...)
现在你仍然必须告诉Spark每个表的键列是什么。我们假设您的列表与表格列表的顺序相同。您可以zip
列表创建dfColMap
,并且您将拥有应用上述解决方案所需的一切:
val keyColList = List("xx", "yy", "zz")
val dfColMap = dfList.zip(keyColList).toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)
[更新#2]
如果您将Hive表名称及其对应的键列名存储在DataFrame中,则可以按如下方式生成dfColMap
:
val dfTabColPair = Seq(
("tableX", "xx"),
("tableY", "yy"),
("tableZ", "zz")
).toDF("tabname", "colname")
val tabColPairList = dfTabColPair.rdd.map( r => (r(0).toString, r(1).toString)).
collect
// tabColPairList: Array[(String, String)] = Array((tableX,xx), (tableY,yy), (tableZ,zz))
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)