在火花中遍历每列并找到最大长度

时间:2019-01-19 01:14:35

标签: scala apache-spark aggregation maxlength

我是Spark Scala的新手,并且遇到以下情况 我在群集上有一个表“ TEST_TABLE”(可以是配置单元表) 我正在将其转换为数据框 为:

PYSPARK_PYTHON=/opt/miniconda/bin/python nohup spark-submit --driver- memory 8g --executor-memory 30g --conf spark.executor.cores=8 --conf spark.driver.maxResultSize=8g --conf spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER --conf spark.debug.maxToStringFields=100 --conf spark.sql.hive.convertMetastoreParquet=false stk_overflow.py > stk_oflow0120.txt 2>&1

现在DF可以显示为

scala> val testDF = spark.sql("select * from TEST_TABLE limit 10")

我想要类似下面的输出

scala> testDF.show()

COL1|COL2|COL3  
----------------
abc|abcd|abcdef 
a|BCBDFG|qddfde 
MN|1234B678|sd

在spark scala中这样做可行吗?

3 个答案:

答案 0 :(得分:4)

简单明了:

import org.apache.spark.sql.functions._

val df = spark.table("TEST_TABLE")
df.select(df.columns.map(c => max(length(col(c)))): _*)

答案 1 :(得分:1)

您可以通过以下方式尝试:

class Bar {
    private bar: number;
}
function doSomething(p: Bar) { }
doSomething({ // error now
    bar: 10
})
doSomething({ // breaking out of type safety
    bar: 10
} as any)

我认为最好缓存输入数据帧import org.apache.spark.sql.functions.{length, max} import spark.implicits._ val df = Seq(("abc","abcd","abcdef"), ("a","BCBDFG","qddfde"), ("MN","1234B678","sd"), (null,"","sd")).toDF("COL1","COL2","COL3") df.cache() val output = df.columns.map(c => (c, df.agg(max(length(df(s"$c")))).as[Int].first())).toSeq.toDF("COLUMN_NAME", "MAX_LENGTH") +-----------+----------+ |COLUMN_NAME|MAX_LENGTH| +-----------+----------+ | COL1| 3| | COL2| 8| | COL3| 6| +-----------+----------+ ,以加快计算速度。

答案 2 :(得分:1)

这是获取垂直列名称报告的另一种方法

scala> val df = Seq(("abc","abcd","abcdef"),("a","BCBDFG","qddfde"),("MN","1234B678","sd")).toDF("COL1","COL2","COL3")
df: org.apache.spark.sql.DataFrame = [COL1: string, COL2: string ... 1 more field]

scala> df.show(false)
+----+--------+------+
|COL1|COL2    |COL3  |
+----+--------+------+
|abc |abcd    |abcdef|
|a   |BCBDFG  |qddfde|
|MN  |1234B678|sd    |
+----+--------+------+

scala> val columns = df.columns
columns: Array[String] = Array(COL1, COL2, COL3)

scala> val df2 = columns.foldLeft(df) { (acc,x) => acc.withColumn(x,length(col(x))) }
df2: org.apache.spark.sql.DataFrame = [COL1: int, COL2: int ... 1 more field]

scala> df2.select( columns.map(x => max(col(x))):_* ).show(false)
+---------+---------+---------+
|max(COL1)|max(COL2)|max(COL3)|
+---------+---------+---------+
|3        |8        |6        |
+---------+---------+---------+


scala> df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).show(false)
+----+---+
|_1  |_2 |
+----+---+
|COL1|3  |
|COL2|8  |
|COL3|6  |
+----+---+


scala>

要将结果放入Scala集合中,请说Map()

scala> val result = df3.flatMap( r => { (0 until r.length).map( i => (columns(i),r.getInt(i)) ) } ).as[(String,Int)].collect.toMap
result: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)

scala> result
res47: scala.collection.immutable.Map[String,Int] = Map(COL1 -> 3, COL2 -> 8, COL3 -> 6)

scala>