Question

我写了这个scala代码，为Spark DataFrame中的每一行做一些事情。基本上这些是我做的步骤

1. I convert the DataFrame into an array 
2. Iterate through the array and perform calculations and get the output in an array
3. convert the output of the array to a dataframe and then make a Hive table.

在步骤2中，当我运行一百万条记录时，我遇到了问题。无论如何，我可以提高性能。仅供参考我只将数据帧转换为数组，因为无法迭代AFAIK spark数据帧。

def getRows (ca : org.apache.spark.sql.DataFrame ) = 
{
  val allca = List()
  for (a <- ca.collect()) yield
  {
    val newAddress = a.getString(1)
    val output = newAddress  :: getRecursiveList(newAddress).reverse


  val dataset = 
 CA (account.getInt(0),
            account.getString(1),
            account.getString(2), 
            output.toString)

  dataset :: allca
  }
}

val myArray = getRows(customerAccounts)

val OutputDataFrame = sc.parallelize(myArray.flatMap(x => x)).toDF

OutputDataFrame.show()


val resultsRDD = OutputDataFrame.registerTempTable("history")

spark.sql(""" insert into user_tech.history select * from history """).collect.foreach(println)

Answer 1

请理解一些基础知识：

Spark scala / Java API提供了非常高级别的视角，并且不提供数据结构的分布式特性。
迭代数据帧有两个选择：您是以分布式方式迭代它们还是在一台机器上收集所有数据然后迭代。
ca.collect（）从所有节点的数据框中收集数据，并将数据提供给驱动程序进行处理，这不是可扩展的解决方案。

请按照以下链接获取更好的理解
1. http://bytepadding.com/big-data/spark/spark-code-analysis/
2. http://bytepadding.com/big-data/spark/understanding-spark-through-map-reduce/

将循环的输出写入数据帧

1 个答案: