在以下情况下,请您帮我处理databricks pyspark 2.4.3代码中的以下情况,该代码处理从aws s3存储桶读取的TB级json数据。
情况1:
当前,我正在依次读取多个表以获取天数 从每个表。
例如:table_list.csv的一列具有多个表名
年= 2019 月= 12
tablesDF = spark.read.format("csv").option("header",false).load("s3a://bucket//source/table_list.csv")
tabList = tablesDF.toPandas().values.tolist()
for table in tabList:
tab_name = table[0]
//雪花设置和雪花表count()
sfOptions = dict(
"sfURL" -> "",
"sfAccount" -> "",
"sfUser" -> "",
"sfPassword" -> "",
"sfDatabase" -> "",
"sfSchema" -> "",
"sfWarehouse" -> "",
)
// Read data as dataframe
sfxdf = spark.read
.format("snowflake")
.options(**sfOptions)
.option("query", "select y as year,m as month,count(*) as sCount from
{} where y={} and m={} group by year,month").format(tab_name,year,month)
.load()
// databricks三角洲湖
dbxDF = spark.sql("select y as year,m as month,count(*) as dCount from
db.{} where y={} and m={}" group by year,month).format(tab_name,year,month)
resultDF = dbxDF.join(sfxdf, on=['year', 'month'], how='left_outer'
).na.fill(0).withColumn("flag_col", expr("dCount == sCount"))
finalDF = resultDF.withColumn("table_name",
lit(tab_name)).select("table_name","year","month","dCount","sCount","flag_col")
finalDF.coalesce(1).write.format('csv').option('header', 'true').save("s3a://outputs/reportcsv)
问题:
1)目前,我正在使用计数查询,通过将eables取一来进行基于序列的运行。
2)如何并行地从csv文件中读取所有表,并并行运行计数查询,并在整个群集中分配作业?
3)您能告诉我如何在pyspark中优化上述代码,以便同时对所有计数查询进行多线程处理吗?
谢谢 安布