Question

我目前正在处理来自Google BigQuery Public数据集的StackOverflow数据集。

我想在给定的国家/地区和标签中找到numQuestions / Answers的最佳用户（可能甚至还有他们的问题，答案得分等）

如果我首先使用userid，tag，numQuestions，numAnswers等计算数据集，我认为查询会加速

所以我做了

usersQuestionsDf = spark.sql("""
    SELECT uid, tag, COUNT(*) AS numQuestions, SUM(favorite_count) AS favs
    FROM (
        SELECT u.id AS uid, explode(q.tags) AS tag, q.favorite_count
        FROM usersFixedCountries u
        INNER JOIN questions q
        ON q.owner_user_id = u.id
        WHERE country IS NOT NULL
    )
    GROUP BY uid, tag
""")

usersAnswersDf = spark.sql("""
    SELECT uid, tag, SUM(score) AS score, COUNT(*) AS numAnswers
    FROM (
        SELECT u.id AS uid, explode(q.tags) AS tag, a.score
        FROM usersFixedCountries u
        INNER JOIN answers a
        ON a.owner_user_id = u.id
        INNER JOIN questions q
        ON q.id = a.parent_id
        WHERE country is NOT NULL
    )
    GROUP BY uid, tag
""")

然后我试着这样做：

usersAnswersDf.createOrReplaceTempView("usersAnswers")
usersQuestionsDf.createOrReplaceTempView("usersQuestions")
usersTagScoreDf = spark.sql("""
    SELECT q.uid, q.tag, numAnswers, score AS answersScore, numQuestions, favs AS questionsFavs, u.display_name, u.up_votes
    FROM usersAnswers a
    FULL OUTER JOIN usersQuestions q
    ON q.uid = a.uid
    AND q.tag = a.tag
    INNER JOIN usersFixedCountries u
    ON u.id = q.uid
    WHERE u.country IS NOT NULL
""")

问题是最后一部分抛出错误。我觉得它耗尽了内存。所以我想知道如何优化这个？也许我的查询效率低下？

An error occurred while calling o132.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(uid#350, 200)
+- *Project [score#391L, numAnswers#392L, uid#350, tag#355, numQuestions#352L, favs#353L]
   +- *BroadcastHashJoin [uid#389, tag#394], [uid#350, tag#355], RightOuter, BuildLeft
      :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, true], input[1, string, true]))
      :  +- InMemoryTableScan [uid#389, tag#394, score#391L, numAnswers#392L]
      :        +- InMemoryRelation [uid#389, tag#394, score#391L, numAnswers#392L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :              +- *HashAggregate(keys=[uid#175, tag#180], functions=[sum(cast(score#33 as bigint)), count(1)], output=[uid#175, tag#180, score#177L, numAnswers#178L])
      :                 +- Exchange hashpartitioning(uid#175, tag#180, 200)
      :                    +- *HashAggregate(keys=[uid#175, tag#180], functions=[partial_sum(cast(score#33 as bigint)), partial_count(1)], output=[uid#175, tag#180, sum#189L, count#190L])
      :                       +- *Project [id#82 AS uid#175, tag#180, score#33]
      :                          +- Generate explode(tags#2), true, false, [tag#180]
      :                             +- *Project [id#82, score#33, tags#2]
      :                                +- *SortMergeJoin [parent_id#32], [id#0], Inner
      :                                   :- *Sort [parent_id#32 ASC NULLS FIRST], false, 0
      :                                   :  +- Exchange hashpartitioning(parent_id#32, 200)
      :                                   :     +- *Project [id#82, parent_id#32, score#33]
      :                                   :        +- *SortMergeJoin [id#82], [owner_user_id#31], Inner
      :                                   :           :- *Sort [id#82 ASC NULLS FIRST], false, 0
      :                                   :           :  +- Exchange hashpartitioning(id#82, 200)
      :                                   :           :     +- *Project [id#82]
      :                                   :           :        +- *Filter (isnotnull(country#89) && isnotnull(id#82))
      :                                   :           :           +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
      :                                   :           +- *Sort [owner_user_id#31 ASC NULLS FIRST], false, 0
      :                                   :              +- Exchange hashpartitioning(owner_user_id#31, 200)
      :                                   :                 +- *Project [owner_user_id#31, parent_id#32, score#33]
      :                                   :                    +- *Filter (isnotnull(owner_user_id#31) && isnotnull(parent_id#32))
      :                                   :                       +- *FileScan parquet [owner_user_id#31,parent_id#32,score#33,creation_year#36] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/answers.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id), IsNotNull(parent_id)], ReadSchema: struct<owner_user_id:int,parent_id:int,score:int>
      :                                   +- *Sort [id#0 ASC NULLS FIRST], false, 0
      :                                      +- Exchange hashpartitioning(id#0, 200)
      :                                         +- *Project [id#0, tags#2]
      :                                            +- *Filter isnotnull(id#0)
      :                                               +- *FileScan parquet [id#0,tags#2,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,tags:array<string>>
      +- *Filter isnotnull(uid#350)
         +- InMemoryTableScan [uid#350, tag#355, numQuestions#352L, favs#353L], [isnotnull(uid#350)]
               +- InMemoryRelation [uid#350, tag#355, numQuestions#352L, favs#353L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                     +- *HashAggregate(keys=[uid#112, tag#117], functions=[count(1), sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, numQuestions#114L, favs#115L])
                        +- Exchange hashpartitioning(uid#112, tag#117, 200)
                           +- *HashAggregate(keys=[uid#112, tag#117], functions=[partial_count(1), partial_sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, count#126L, sum#127L])
                              +- *Project [id#82 AS uid#112, tag#117, favorite_count#9]
                                 +- Generate explode(tags#2), true, false, [tag#117]
                                    +- *Project [id#82, tags#2, favorite_count#9]
                                       +- *SortMergeJoin [id#82], [owner_user_id#3], Inner
                                          :- *Sort [id#82 ASC NULLS FIRST], false, 0
                                          :  +- Exchange hashpartitioning(id#82, 200)
                                          :     +- *Project [id#82]
                                          :        +- *Filter (isnotnull(country#89) && isnotnull(id#82))
                                          :           +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
                                          +- *Sort [owner_user_id#3 ASC NULLS FIRST], false, 0
                                             +- Exchange hashpartitioning(owner_user_id#3, 200)
                                                +- *Project [tags#2, owner_user_id#3, favorite_count#9]
                                                   +- *Filter isnotnull(owner_user_id#3)
                                                      +- *FileScan parquet [tags#2,owner_user_id#3,favorite_count#9,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id)], ReadSchema: struct<tags:array<string>,owner_user_id:int,favorite_count:int>
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252)
    at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegenExec.scala:244)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.joins.SortMergeJoinExec.inputRDDs(SortMergeJoinExec.scala:377)
    at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
    at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
    at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
    at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
    at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
    at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
    at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
    at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
    at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
    at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
    at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:235)
    at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:263)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:235)
    at org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:128)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:88)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:38)
    at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:46)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
    at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
    at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:36)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:331)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:372)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
    at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    ... 55 more
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 336, in show
    print(self._jdf.showString(n, 20))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o132.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(uid#350, 200)
+- *Project [score#391L, numAnswers#392L, uid#350, tag#355, numQuestions#352L, favs#353L]
   +- *BroadcastHashJoin [uid#389, tag#394], [uid#350, tag#355], RightOuter, BuildLeft
      :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, true], input[1, string, true]))
      :  +- InMemoryTableScan [uid#389, tag#394, score#391L, numAnswers#392L]
      :        +- InMemoryRelation [uid#389, tag#394, score#391L, numAnswers#392L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
      :              +- *HashAggregate(keys=[uid#175, tag#180], functions=[sum(cast(score#33 as bigint)), count(1)], output=[uid#175, tag#180, score#177L, numAnswers#178L])
      :                 +- Exchange hashpartitioning(uid#175, tag#180, 200)
      :                    +- *HashAggregate(keys=[uid#175, tag#180], functions=[partial_sum(cast(score#33 as bigint)), partial_count(1)], output=[uid#175, tag#180, sum#189L, count#190L])
      :                       +- *Project [id#82 AS uid#175, tag#180, score#33]
      :                          +- Generate explode(tags#2), true, false, [tag#180]
      :                             +- *Project [id#82, score#33, tags#2]
      :                                +- *SortMergeJoin [parent_id#32], [id#0], Inner
      :                                   :- *Sort [parent_id#32 ASC NULLS FIRST], false, 0
      :                                   :  +- Exchange hashpartitioning(parent_id#32, 200)
      :                                   :     +- *Project [id#82, parent_id#32, score#33]
      :                                   :        +- *SortMergeJoin [id#82], [owner_user_id#31], Inner
      :                                   :           :- *Sort [id#82 ASC NULLS FIRST], false, 0
      :                                   :           :  +- Exchange hashpartitioning(id#82, 200)
      :                                   :           :     +- *Project [id#82]
      :                                   :           :        +- *Filter (isnotnull(country#89) && isnotnull(id#82))
      :                                   :           :           +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
      :                                   :           +- *Sort [owner_user_id#31 ASC NULLS FIRST], false, 0
      :                                   :              +- Exchange hashpartitioning(owner_user_id#31, 200)
      :                                   :                 +- *Project [owner_user_id#31, parent_id#32, score#33]
      :                                   :                    +- *Filter (isnotnull(owner_user_id#31) && isnotnull(parent_id#32))
      :                                   :                       +- *FileScan parquet [owner_user_id#31,parent_id#32,score#33,creation_year#36] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/answers.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id), IsNotNull(parent_id)], ReadSchema: struct<owner_user_id:int,parent_id:int,score:int>
      :                                   +- *Sort [id#0 ASC NULLS FIRST], false, 0
      :                                      +- Exchange hashpartitioning(id#0, 200)
      :                                         +- *Project [id#0, tags#2]
      :                                            +- *Filter isnotnull(id#0)
      :                                               +- *FileScan parquet [id#0,tags#2,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,tags:array<string>>
      +- *Filter isnotnull(uid#350)
         +- InMemoryTableScan [uid#350, tag#355, numQuestions#352L, favs#353L], [isnotnull(uid#350)]
               +- InMemoryRelation [uid#350, tag#355, numQuestions#352L, favs#353L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
                     +- *HashAggregate(keys=[uid#112, tag#117], functions=[count(1), sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, numQuestions#114L, favs#115L])
                        +- Exchange hashpartitioning(uid#112, tag#117, 200)
                           +- *HashAggregate(keys=[uid#112, tag#117], functions=[partial_count(1), partial_sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, count#126L, sum#127L])
                              +- *Project [id#82 AS uid#112, tag#117, favorite_count#9]
                                 +- Generate explode(tags#2), true, false, [tag#117]
                                    +- *Project [id#82, tags#2, favorite_count#9]
                                       +- *SortMergeJoin [id#82], [owner_user_id#3], Inner
                                          :- *Sort [id#82 ASC NULLS FIRST], false, 0
                                          :  +- Exchange

优化PySpark中的大型连接

0 个答案: