如何在spark中执行多个连接数据框?

时间:2018-04-29 22:09:00

标签: apache-spark dataframe join pyspark

我有多个(4)火花数据帧(每一行有1000万行) 我需要在多列(4列)上连接这些数据帧 实际上,我使用下面的代码,但它非常慢

sql = SQLContext(sc)
df1 = sql.read.format("org.apache.spark.sql.cassandra").\
             load(keyspace="db", table="table1").\
             select('time','regionid','wilayaid','siteid','col1', 'col2','col3','col4','col5')
df2 = sql.read.format("org.apache.spark.sql.cassandra").\
             load(keyspace="db", table="table2").\
             select('time','regionid','wilayaid','siteid','col6', 'col7','col8','col9')
df3 = sql.read.format("org.apache.spark.sql.cassandra").\
             load(keyspace="db", table="table3").\
             select('time','regionid','wilayaid','siteid','col10')
df4 = sql.read.format("org.apache.spark.sql.cassandra").\
             load(keyspace="db", table="table4").\
             select('time','regionid','wilayaid','siteid','col12')
join1 = df1.join(edf2,['time','regionid','wilayaid','siteid'])
join2 = join1.join(df3,['time','regionid','wilayaid','siteid'])
join3 = join2.join(df4,['time','regionid','wilayaid','siteid'])
join3.show()

执行此代码的任何建议

0 个答案:

没有答案