优化Apache Spark SQL查询

时间:2017-08-25 15:51:02

标签: performance apache-spark pyspark-sql

在运行一些SQL查询时,我在Apache Spark上面临很长的延迟。为了简化查询,我以顺序方式运行计算:每个查询的输出都存储为临时表(.registerTempTable(' TEMP')),因此可以在以下SQL中使用查询等等......但是查询需要花费太多时间,而在“纯Python”中却是如此。代码,只需几分钟。



sqlContext.sql("""
SELECT PFMT.* , 
DICO_SITES.CodeAPI
FROM PFMT 
INNER JOIN DICO_SITES
ON PFMT.assembly_department = DICO_SITES.CodeProg """).registerTempTable("PFMT_API_CODE")

sqlContext.sql(""" 
SELECT GAMMA.*, 
(GAMMA.VOLUME*GAMMA.PRORATA)/100 AS VOLUME_PER_SUPPLIER
FROM
(SELECT PFMT_API_CODE.* , 
SUPPLIERS_PROP.CODE_SITE_FOURNISSEUR,
SUPPLIERS_PROP.PRORATA 
FROM PFMT_API_CODE 
INNER JOIN SUPPLIERS_PROP ON PFMT_API_CODE.reference = SUPPLIERS_PROP.PIE_NUMERO 
AND PFMT_API_CODE.project_code = SUPPLIERS_PROP.FAM_CODE 
AND PFMT_API_CODE.CodeAPI = SUPPLIERS_PROP.SITE_UTILISATION_FINAL) GAMMA """).registerTempTable("TEMP_ONE")

sqlContext.sql("""
SELECT TEMP_ONE.* , 
ADCP_DATA.* , 
CASE 
WHEN  ADCP_DATA.WEEK  <= weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_ST + ADCP_DATA.ADD_CAPACITY_ST
WHEN  ADCP_DATA.WEEK  > weekofyear(from_unixtime(unix_timestamp())) + 24 THEN ADCP_DATA.CAPACITY_LT + ADCP_DATA.ADD_CAPACITY_LT
END AS CAPACITY_REF
FROM TEMP_ONE
INNER JOIN ADCP_DATA
ON TEMP_ONE.reference = ADCP_DATA.PART_NUMBER
AND TEMP_ONE.CodeAPI = ADCP_DATA.API_CODE
AND TEMP_ONE.project_code = ADCP_DATA.PROJECT_CODE
AND TEMP_ONE.CODE_SITE_FOURNISSEUR = ADCP_DATA.SUPPLIER_SITE_CODE
AND TEMP_ONE.WEEK_NUM = ADCP_DATA.WEEK_NUM
""" ).registerTempTable('TEMP_BIS')

sqlContext.sql("""
SELECT TEMP_BIS.CSF_ID, 
TEMP_BIS.CF_ID ,
TEMP_BIS.CAPACITY_REF, 
TEMP_BIS.VOLUME_PER_SUPPLIER, 
CASE 
WHEN TEMP_BIS.CAPACITY_REF >= VOLUME_PER_SUPPLIER THEN 'CAPACITY_OK'
WHEN TEMP_BIS.CAPACITY_REF < VOLUME_PER_SUPPLIER THEN 'CAPACITY_NOK'
END AS CAPACITY_CHECK
FROM TEMP_BIS
""").take(100)
&#13;
&#13;
&#13;

有人可以强调(如果有的话)在Spark上编写pyspark SQL查询的最佳实践吗? 在我的计算机上本地脚本比在Hadoop集群上快得多是否有意义? 提前致谢

1 个答案:

答案 0 :(得分:0)

您应该缓存中间结果,数据源是什么? 您是否可以仅从中检索相关数据或仅检索相关列。有很多选项可以提供有关数据的更多信息。