Question

我是SPARK的新手，想出了实现以下方案的更好方法。有一个包含3个字段的数据库表 - Category，Amount，Quantity。首先，我尝试从数据库中提取所有不同的类别。

 val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)

现在，对于每个类别，我想执行Pipeline，它实际上是从每个类别创建数据帧并应用一些机器学习。

 categories.foreach(executePipeline)
 def execute(category: String): Unit = {
   val dfCategory = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME WHERE CATEGORY="+category)
dfCategory.show()    
}

有可能做这样的事吗？还是有更好的选择吗？

Answer 1

// You could get all your data with a single query and convert it to an rdd
val data = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME).rdd

// then group the data by category
val groupedData = data.groupBy(row => row.getAs[String]("category"))

// then you get an RDD[(String, Iterable[org.apache.spark.sql.Row])]
// and you can iterate over it and execute your pipeline
groupedData.map { case (categoryName, items) =>
  //executePipeline(categoryName, items)
}

Answer 2

您的代码在TaskNotSerializable例外时会失败，因为您尝试在SQLContext方法中使用execute（不可序列化），被序列化并发送给工人，以便在categories RDD中的每条记录上执行。

假设你知道类别的数量是有限的，这意味着类别列表不是太大而不适合你的驱动程序内存，你应该< strong>将类别收集到驱动程序，并使用foreach迭代该本地集合：

val categoriesRdd: RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)
val categories: Seq[String] = categoriesRdd.collect()
categories.foreach(executePipeline)

另一项改进是重复使用您加载的数据框而不是执行其他查询，对每个类别使用过滤器：

def executePipeline(singleCategoryDf: DataFrame) { /* ... */ }

categories.foreach(cat => {
  val filtered = df.filter(col(CATEGORY) === cat)
  executePipeline(filtered)
})

注意：为确保重复使用df不会为每次执行重新加载，请确保在收集类别之前cache()。

SPARK - 使用RDD.foreach创建Dataframe并对Dataframe执行操作

2 个答案: