I am using Spark 2.4.0 on an AWS cluster. The purpose is ETL and it is heavily based on Spark SQL using pyspark. I have a multitude of python scripts that are invoked in sequence. There are data dependencies between these scripts. There is a main.py that invokes other scripts like process1.py, process2.py etc.
The invocation is done using:
#invoking process 1
command1 = "/anaconda3/bin/python3 process1.py"
p1 = subprocess.Popen(command1.split(" "), stdout=PIPE, stderr=subprocess.STDOUT)
p.wait()
#invoking process 2
command2 = "/anaconda3/bin/python3 process2.py"
p2 = subprocess.Popen(command2.split(" "), stdout=PIPE, stderr=subprocess.STDOUT)
p.wait()
Each of these processes (process1.py, process2.py etc ) are doing dataframe transformations using SQL based syntax like:
df_1. createGlobalTempView ('table_1')
result_1 = spark.sql('select * from table_1 where <some conditions>')
The challenge is that I want dataframes (like df_1
or result_1
) and/or tables (like table_1
) to be accessible across the processing sequence. So, for example if the code above is in process1.py the generated df_1
or table_1
to be accessible in process2.py.
Main.py, process1.py and process2.py are getting the spark session using:
spark = SparkSession.builder.appName("example-spark").config("spark.sql.crossJoin.enabled", "true").getOrCreate()
I know there is the option to use HIVE for storing table_1
but I am trying to avoid if possible this scenario.
Thanks a lot for help!