我有多个(4)火花数据帧(每一行有1000万行) 我需要在多列(4列)上连接这些数据帧 实际上,我使用下面的代码,但它非常慢
sql = SQLContext(sc)
df1 = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="db", table="table1").\
select('time','regionid','wilayaid','siteid','col1', 'col2','col3','col4','col5')
df2 = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="db", table="table2").\
select('time','regionid','wilayaid','siteid','col6', 'col7','col8','col9')
df3 = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="db", table="table3").\
select('time','regionid','wilayaid','siteid','col10')
df4 = sql.read.format("org.apache.spark.sql.cassandra").\
load(keyspace="db", table="table4").\
select('time','regionid','wilayaid','siteid','col12')
join1 = df1.join(edf2,['time','regionid','wilayaid','siteid'])
join2 = join1.join(df3,['time','regionid','wilayaid','siteid'])
join3 = join2.join(df4,['time','regionid','wilayaid','siteid'])
join3.show()
执行此代码的任何建议