Question

所以我有一个巨大的数据框，它是各个表的组合，它的末尾有一个标识符列，用于指定表号，如下所示

+----------------------------+
| col1 col2 .... table_num   |
+----------------------------+
| x     y            1       |
| a     b            1       |
| .     .            .       |
| .     .            .       |
| q     p            2       |
+----------------------------+

(original table)

我必须根据表号将其拆分为多个小数据帧。合并以创建表的表的数量非常大，因此单独创建不相交的子集数据帧是不可行的，因此我在考虑是否要使for循环遍历table_num的最小值到最大值来实现此任务，但是我可以做到这一点。似乎没有做到，任何帮助将不胜感激。

这就是我想出的

for (x < min(table_num) to max(table_num)) {

var df(x)= spark.sql("select * from df1 where state = x")
df(x).collect()

但我认为声明不正确。

所以本质上我需要的是看起来像这样的dp

+-----------------------------+
| col1  col2  ...   table_num |
+-----------------------------+
| x      y             1      |
| a      b             1      |
+-----------------------------+


+------------------------------+
| col1   col2  ...   table_num |
+------------------------------+
| xx      xy             2     |
| aa      bb             2     |
+------------------------------+

+-------------------------------+
| col1    col2  ...   table_num |
+-------------------------------+
| xxy      yyy             3    |
| aaa      bbb             3    |
+-------------------------------+

...等等...

(how I would like the Dataframes split)

Answer 1

在Spark数组中，几乎可以是数据类型。当制成var时，您可以动态地从中添加和删除元素。下面，我将表num隔离到自己的数组中，这样我就可以轻松地遍历它们。隔离后，我经历了一个while循环，将每个表作为唯一元素添加到DF Holder数组。要查询数组的元素，请使用DFHolderArray（n-1），其中n是要查询的位置，第一个元素为0。

//This will go and turn the distinct row nums in a queriable (this is 100% a word) array
val tableIDArray = inputDF.selectExpr("table_num").distinct.rdd.map(x=>x.mkString.toInt).collect

//Build the iterator
var iterator = 1  

//holders for DF and transformation step
var tempDF = spark.sql("select 'foo' as bar")
var interimDF = tempDF

//This will be an array for dataframes
var DFHolderArray : Array[org.apache.spark.sql.DataFrame] = Array(tempDF) 

//loop while the you have note reached end of array
while(iterator<=tableIDArray.length) {
  //Call the table that is stored in that location of the array
  tempDF = spark.sql("select * from df1 where state = '" + tableIDArray(iterator-1) + "'")
  //Fluff
  interimDF = tempDF.withColumn("User_Name", lit("Stack_Overflow"))
  //If logic to overwrite or append the DF
  DFHolderArray = if (iterator==1) {
    Array(interimDF)
  } else {
    DFHolderArray ++ Array(interimDF)
  }
  iterator = iterator + 1
}

//To query the data
DFHolderArray(0).show(10,false)
DFHolderArray(1).show(10,false)
DFHolderArray(2).show(10,false)
//....

Answer 2

方法是收集所有唯一密钥并构建各自的数据帧。我在其中添加了一些功能性风味。

样本数据集：

require(httr)

headers = c(
  `Content-Type` = 'text/csv'
)

data = upload_file('data/data.csv')
res <- httr::POST(url = 'https://some.url.com/invocations', httr::add_headers(.headers=headers), body = data)

代码：

  name,year,country,id
  Bayern Munich,2014,Germany,7747
  Bayern Munich,2014,Germany,7747
  Bayern Munich,2014,Germany,7746
  Borussia Dortmund,2014,Germany,7746
  Borussia Mönchengladbach,2014,Germany,7746
  Schalke 04,2014,Germany,7746
  Schalke 04,2014,Germany,7753
  Lazio,2014,Germany,7753

Answer 3

一种方法是将write划分为Parquet files的数据帧，然后read将它们重新回到Map，如下所示：

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("a", "b", 1), ("c", "d", 1), ("e", "f", 1), 
  ("g", "h", 2), ("i", "j", 2)
).toDF("c1", "c2", "table_num")

val filePath = "/path/to/parquet/files"

df.write.partitionBy("table_num").parquet(filePath)

val tableNumList = df.select("table_num").distinct.map(_.getAs[Int](0)).collect
// tableNumList: Array[Int] = Array(1, 2)

val dfMap = ( for { n <- tableNumList } yield
    (n, spark.read.parquet(s"$filePath/table_num=$n").withColumn("table_num", lit(n)))
  ).toMap

要从Map访问各个DataFrame，请执行以下操作：

dfMap(1).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// |  a|  b|        1|
// |  c|  d|        1|
// |  e|  f|        1|
// +---+---+---------+

dfMap(2).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// |  g|  h|        2|
// |  i|  j|        2|
// +---+---+---------+

如何使用在循环中迭代的变量在for循环中创建数据帧

3 个答案: