Pyspark多处理方法或代码优化

时间:2020-01-23 12:03:45

标签: python pyspark pyspark-sql pyspark-dataframes

在以下情况下,请您帮我处理databricks pyspark 2.4.3代码中的以下情况,该代码处理从aws s3存储桶读取的TB级json数据。

情况1:

当前,我正在依次读取多个表以获取天数 从每个表。

例如:table_list.csv的一列具有多个表名

年= 2019 月= 12

tablesDF = spark.read.format("csv").option("header",false).load("s3a://bucket//source/table_list.csv")
tabList = tablesDF.toPandas().values.tolist()
for table in tabList:
    tab_name = table[0]

//雪花设置和雪花表count()

sfOptions = dict(
  "sfURL" -> "",
  "sfAccount" -> "",
  "sfUser" -> "",
  "sfPassword" -> "",
  "sfDatabase" -> "",
  "sfSchema" -> "",
  "sfWarehouse" -> "",
)

// Read data as dataframe

sfxdf = spark.read
  .format("snowflake")
  .options(**sfOptions)
  .option("query", "select y as year,m as month,count(*) as sCount from
{} where y={} and m={} group by year,month").format(tab_name,year,month)
      .load()

// databricks三角洲湖

     dbxDF = spark.sql("select y as year,m as month,count(*) as dCount from
db.{} where y={} and m={}" group by year,month).format(tab_name,year,month)

resultDF = dbxDF.join(sfxdf, on=['year', 'month'], how='left_outer'
).na.fill(0).withColumn("flag_col", expr("dCount == sCount"))


    finalDF = resultDF.withColumn("table_name",
lit(tab_name)).select("table_name","year","month","dCount","sCount","flag_col")


    finalDF.coalesce(1).write.format('csv').option('header', 'true').save("s3a://outputs/reportcsv)

问题:

1)目前,我正在使用计数查询,通过将eables取一来进行基于序列的运行。

2)如何并行地从csv文件中读取所有表,并并行运行计数查询,并在整个群集中分配作业?​​

3)您能告诉我如何在pyspark中优化上述代码,以便同时对所有计数查询进行多线程处理吗?

谢谢 安布

0 个答案:

没有答案