我写了这个scala代码,为Spark DataFrame中的每一行做一些事情。基本上这些是我做的步骤
1. I convert the DataFrame into an array
2. Iterate through the array and perform calculations and get the output in an array
3. convert the output of the array to a dataframe and then make a Hive table.
在步骤2中,当我运行一百万条记录时,我遇到了问题。无论如何,我可以提高性能。仅供参考我只将数据帧转换为数组,因为无法迭代AFAIK spark数据帧。
def getRows (ca : org.apache.spark.sql.DataFrame ) =
{
val allca = List()
for (a <- ca.collect()) yield
{
val newAddress = a.getString(1)
val output = newAddress :: getRecursiveList(newAddress).reverse
val dataset =
CA (account.getInt(0),
account.getString(1),
account.getString(2),
output.toString)
dataset :: allca
}
}
val myArray = getRows(customerAccounts)
val OutputDataFrame = sc.parallelize(myArray.flatMap(x => x)).toDF
OutputDataFrame.show()
val resultsRDD = OutputDataFrame.registerTempTable("history")
spark.sql(""" insert into user_tech.history select * from history """).collect.foreach(println)
答案 0 :(得分:0)
请理解一些基础知识:
ca.collect()从所有节点的数据框中收集数据,并将数据提供给驱动程序进行处理,这不是可扩展的解决方案。
请按照以下链接获取更好的理解