我使用SparkSQL从5个维度中计算事实表。我面临性能问题(工作需要几个小时才能完成),即使经过详尽的谷歌搜索,我也看不到任何解决方案。这些是我尝试过的设置,但没有成功。
sqlContext.sql("set spark.sql.shuffle.partitions=10"); // varied between 10 and 5000
sqlContext.sql("set spark.sql.autoBroadcastJoinThreshold=500000000"); // 500 MB, tried 1 GB also
我怀疑数据偏差问题,因为我看到以下任务和记录分发问题。
大多数RDD都是很好的parittions(每个500个分区),但是最大的维度根本没有分区(images)。也许这可以导致解决方案?下面是我用于计算维度和事实的代码。
`
resultDmn1.registerTempTable("Dmn1");
resultDmn2.registerTempTable("Dmn2");
resultDmn3.registerTempTable("Dmn3");
resultDmn4.registerTempTable("Dmn4");
resultDmn5.registerTempTable("Dmn5");
DataFrame resultFact = sqlContext.sql("SELECT DISTINCT\n" +
" 0 AS FactId,\n" +
" rs.c28 AS c28,\n" +
" dop.DmnId AS dmn_id_dim4,\n" +
" dh.DmnId AS dmn_id_dim5,\n" +
" op.DmnId AS dmn_id_dim3,\n" +
" du.DmnId AS dmn_id_dim2,\n" +
" dc.DmnId AS dmn_id_dim1\n" +
"FROM\n" +
" t10 rs\n" +
" JOIN\n" +
" t11 r ON rs.c29 = r.id\n" +
" JOIN\n" +
" Dmn4 dop ON dop.c26 = r.c25\n" +
" JOIN\n" +
" Dmn5 dh ON dh.Date = r.c27\n" +
" JOIN\n" +
" Dmn3 du ON du.c9 = r.c16\n" +
" JOIN\n" +
" t1 d ON r.c5 = d.id\n" +
" JOIN\n" +
" t2 di ON d.id = di.c5\n" +
" JOIN\n" +
" t3 s ON d.c6 = s.id\n" +
" JOIN\n" +
" t4 p ON s.c7 = p.id\n" +
" JOIN\n" +
" t5 o ON p.c8 = o.id\n" +
" JOIN\n" +
" Dmn1 op ON op.c1 = di.c1\n" +
" JOIN\n" +
" t9 ci ON ci.id = r.c24\n" +
" JOIN\n" +
" Dmn3 dc ON dc.c18 = ci.c23\n" +
"WHERE\n" +
" op.c2 = di.c2\n" +
" AND o.name = op.c30\n" +
" AND di.c3 = op.c3\n" +
" AND di.c4 = op.c4").toSchemaRDD();
resultFact.count();
resultFact.cache();
` 在计算之前,Dmn1具有56行,dmn2 11,dmn3 10,dmn4 12和dmn5 1275533行。一切都在AWS EMR集群上运行,集群中有3个m3.2xlarlar节点(主站+2个从站)。