应用错误收集

我正在将pyspark翻译成Scala的火花，因为Scala spark会很好用。但Scala的火花比pyspark花费更多的时间。任何人都可以在Scala spark中执行这两个查询时遇到问题。

Query1：sqlContext.sql（SELECT a.pair，a.bi_count， a.uni_count，unigram_table.uni_count为uni_count_2，（log（a.bi_count）-log（a.uni_count） - log（unigram_table.uni_count））作为得分FROM（SELECT * FROM bigram_table JOIN unigram_table 在bigram_table.parent = unigram_table.token）作为JOIN unigram_table ON a.child = unigram_table.token WHERE a.bi_count＆gt; 4000 ORDER BY得分DESC限制400000）

pyspark的执行时间 - 3分钟

Scala spark中的执行时间 - 3分钟

Query2：sqlContext.sql（SELECT pair，tri_count，（log（tri_count） - log（count1）-log（count2） -log（unigram_table.uni_count））作为得分FROM（SELECT对，tri_count，count1，child1，child2，unigram_table.uni_count）作为count2 FROM（SELECT pair，child1，child2，tri_count，unigram_table.uni_count as count1 FROM trigram_table JOIN unigram_table ON trigram_table.parent = unigram_table.token）作为JOIN unigram_table ON a.child1 = unigram_table.token）为b JOIN unigram_table ON b.child2 = unigram_table.token WHERE tri_count＆gt; 3000 ORDER BY得分DES）

pyspark的执行时间 - 3分钟

Scala spark中的执行时间 - 12分钟

斯卡拉火花比pyspark更好？

0 个答案: