缓存pyspark数据帧不会导致性能提升

时间:2020-04-16 07:04:13

标签: apache-spark pyspark pyspark-sql

我正在尝试提高脚本效率。

目前,我有10个脚本-它们都读取,处理和输出数据。

尽管它们都从相同的主数据库表中读取,并且只是对数据做不同的事情。

所以我已经合并为一个脚本,以为我只读取一次数据,而不是读取10次。

这不会导致更快的执行速度吗?因为不是。

下面是我正在使用的结构的示例

任何帮助都会很棒

谢谢

'''
TABLE DEFINITION AND CACHING
'''
spark_session = create_session('usage CF')
usage_logs = spark_session.sql("Select * from db.table where dt = " + yday_date ).cache()
user_logs = spark_session.sql("Select * from db2.table2 where dt = " + yday_date ).cache()
usercat = spark_session.sql("Select * from db3.table3 where dt = " + yday_date ).cache()
radius_logs = spark_session.sql("Select * from db.table4 where dt = " + yday_date )
radius = radius_logs.select('emsisdn2', 'sessionid2', 'custavp1').cache()


'''
usage CF
'''
usage = usage_logs.select('field1', 'field2', 'field3')
conditions = [usage.sid == radius.sessionid2]
df3 = usage.join(radius, conditions, how='left')
df4 = df3.groupBy('field1', 'field2').agg(sqlfunc.sum('field3').alias('bytesdl'))
usage = df4.createOrReplaceTempView('usage')
usage_table_output = spark_session.sql(' insert overwrite table outputdb.outputtbl partition(dt = ' + yday_date + ') select "usage" as type, * from usage ')

'''
user CF
'''
user = usage_logs.filter((usage_logs.vslsessid == '0')).select('field1', 'field2', 'field3', 'field4')
conditionsx = [user.sessionid == radius.sessionid2]
user_joined = user.join(radius, conditionsx, how='left')
user_output = user_joined.groupBy('field1', 'field2', 'field3').agg(sqlfunc.sum('field4').alias('bytesdl'))
user = user_output.createOrReplaceTempView('user')
user_table_output = spark_session.sql(' insert overwrite table outputdb.outputtbl2 partition(dt = ' + yday_date + ') select "user" as type, * from user')

0 个答案:

没有答案