使用Scala,我如何将dataFrame拆分为具有相同列值的多个dataFrame(无论是数组还是集合)。 例如,我想拆分以下DataFrame:
ID Rate State
1 24 AL
2 35 MN
3 46 FL
4 34 AL
5 78 MN
6 99 FL
为:
数据集1
ID Rate State
1 24 AL
4 34 AL
数据集2
ID Rate State
2 35 MN
5 78 MN
数据集3
ID Rate State
3 46 FL
6 99 FL
答案 0 :(得分:15)
您可以收集唯一的状态值,只需映射结果数组:
val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))
或映射:
val byStateMap = states
.map(state => (state -> df.where($"State" <=> state)))
.toMap
Python中的相同内容:
from itertools import chain
from pyspark.sql.functions import col
states = chain(*df.select("state").distinct().collect())
# PySpark 2.3 and later
# In 2.2 and before col("state") == state)
# should give the same outcome, ignoring NULLs
# if NULLs are important
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state:
df.where(col("state").eqNullSafe(state)) for state in states}
这里显而易见的问题是它需要对每个级别进行全数据扫描,因此这是一项昂贵的操作。如果您正在寻找分割输出的方法,请参阅How do I split an RDD into two or more RDDs?
特别是您可以编写由感兴趣的列划分的Dataset
:
val path: String = ???
df.write.partitionBy("State").parquet(path)
并在需要时回读:
// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)
// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")
根据数据的大小,输入的分割,存储和持久性级别的数量,它可能比多个过滤器更快或更慢。
答案 1 :(得分:1)
如果将数据帧作为临时表,则非常简单(如果spark版本为2)。
df1.createOrReplaceTempView("df1")
现在你可以进行查询了,
var df2 = spark.sql("select * from df1 where state = 'FL'")
var df3 = spark.sql("select * from df1 where state = 'MN'")
var df4 = spark.sql("select * from df1 where state = 'AL'")
现在你得到了df2,df3,df4。如果您想将它们作为列表,可以使用
df2.collect()
df3.collect()
甚至是地图/过滤功能。请参阅https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
灰
答案 2 :(得分:-1)
you can use ..
var stateDF = df.select("state").distinct() // to get states in a df
val states = stateDF.rdd.map(x=>x(0)).collect.toList //to get states in a list
for (i <- states) //loop to get each state
{
var finalDF = sqlContext.sql("select * from table1 where state = '" + state
+"' ")
}