Question

I am using Spark 2.4.0 on an AWS cluster. The purpose is ETL and it is heavily based on Spark SQL using pyspark. I have a multitude of python scripts that are invoked in sequence. There are data dependencies between these scripts. There is a main.py that invokes other scripts like process1.py, process2.py etc.

The invocation is done using:

#invoking process 1
command1 = "/anaconda3/bin/python3  process1.py"
p1 = subprocess.Popen(command1.split(" "), stdout=PIPE, stderr=subprocess.STDOUT)
p.wait() 

#invoking process 2
command2 = "/anaconda3/bin/python3  process2.py"
p2 = subprocess.Popen(command2.split(" "), stdout=PIPE, stderr=subprocess.STDOUT)
p.wait()

Each of these processes (process1.py, process2.py etc ) are doing dataframe transformations using SQL based syntax like:

df_1. createGlobalTempView ('table_1')
result_1 = spark.sql('select * from table_1 where <some conditions>')

The challenge is that I want dataframes (like df_1 or result_1) and/or tables (like table_1) to be accessible across the processing sequence. So, for example if the code above is in process1.py the generated df_1 or table_1 to be accessible in process2.py.

Main.py, process1.py and process2.py are getting the spark session using:

spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled", "true").getOrCreate()

I know there is the option to use HIVE for storing table_1 but I am trying to avoid if possible this scenario.

Thanks a lot for help!

Spark tables visibility across multiple python scripts

0 个答案: