我目前正在处理来自Google BigQuery Public数据集的StackOverflow数据集。
我想在给定的国家/地区和标签中找到numQuestions / Answers的最佳用户(可能甚至还有他们的问题,答案得分等)
如果我首先使用userid,tag,numQuestions,numAnswers等计算数据集,我认为查询会加速
所以我做了
usersQuestionsDf = spark.sql("""
SELECT uid, tag, COUNT(*) AS numQuestions, SUM(favorite_count) AS favs
FROM (
SELECT u.id AS uid, explode(q.tags) AS tag, q.favorite_count
FROM usersFixedCountries u
INNER JOIN questions q
ON q.owner_user_id = u.id
WHERE country IS NOT NULL
)
GROUP BY uid, tag
""")
usersAnswersDf = spark.sql("""
SELECT uid, tag, SUM(score) AS score, COUNT(*) AS numAnswers
FROM (
SELECT u.id AS uid, explode(q.tags) AS tag, a.score
FROM usersFixedCountries u
INNER JOIN answers a
ON a.owner_user_id = u.id
INNER JOIN questions q
ON q.id = a.parent_id
WHERE country is NOT NULL
)
GROUP BY uid, tag
""")
然后我试着这样做:
usersAnswersDf.createOrReplaceTempView("usersAnswers")
usersQuestionsDf.createOrReplaceTempView("usersQuestions")
usersTagScoreDf = spark.sql("""
SELECT q.uid, q.tag, numAnswers, score AS answersScore, numQuestions, favs AS questionsFavs, u.display_name, u.up_votes
FROM usersAnswers a
FULL OUTER JOIN usersQuestions q
ON q.uid = a.uid
AND q.tag = a.tag
INNER JOIN usersFixedCountries u
ON u.id = q.uid
WHERE u.country IS NOT NULL
""")
问题是最后一部分抛出错误。我觉得它耗尽了内存。所以我想知道如何优化这个?也许我的查询效率低下?
An error occurred while calling o132.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(uid#350, 200)
+- *Project [score#391L, numAnswers#392L, uid#350, tag#355, numQuestions#352L, favs#353L]
+- *BroadcastHashJoin [uid#389, tag#394], [uid#350, tag#355], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, true], input[1, string, true]))
: +- InMemoryTableScan [uid#389, tag#394, score#391L, numAnswers#392L]
: +- InMemoryRelation [uid#389, tag#394, score#391L, numAnswers#392L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[sum(cast(score#33 as bigint)), count(1)], output=[uid#175, tag#180, score#177L, numAnswers#178L])
: +- Exchange hashpartitioning(uid#175, tag#180, 200)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[partial_sum(cast(score#33 as bigint)), partial_count(1)], output=[uid#175, tag#180, sum#189L, count#190L])
: +- *Project [id#82 AS uid#175, tag#180, score#33]
: +- Generate explode(tags#2), true, false, [tag#180]
: +- *Project [id#82, score#33, tags#2]
: +- *SortMergeJoin [parent_id#32], [id#0], Inner
: :- *Sort [parent_id#32 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(parent_id#32, 200)
: : +- *Project [id#82, parent_id#32, score#33]
: : +- *SortMergeJoin [id#82], [owner_user_id#31], Inner
: : :- *Sort [id#82 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#82, 200)
: : : +- *Project [id#82]
: : : +- *Filter (isnotnull(country#89) && isnotnull(id#82))
: : : +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
: : +- *Sort [owner_user_id#31 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(owner_user_id#31, 200)
: : +- *Project [owner_user_id#31, parent_id#32, score#33]
: : +- *Filter (isnotnull(owner_user_id#31) && isnotnull(parent_id#32))
: : +- *FileScan parquet [owner_user_id#31,parent_id#32,score#33,creation_year#36] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/answers.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id), IsNotNull(parent_id)], ReadSchema: struct<owner_user_id:int,parent_id:int,score:int>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- *Project [id#0, tags#2]
: +- *Filter isnotnull(id#0)
: +- *FileScan parquet [id#0,tags#2,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,tags:array<string>>
+- *Filter isnotnull(uid#350)
+- InMemoryTableScan [uid#350, tag#355, numQuestions#352L, favs#353L], [isnotnull(uid#350)]
+- InMemoryRelation [uid#350, tag#355, numQuestions#352L, favs#353L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[count(1), sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, numQuestions#114L, favs#115L])
+- Exchange hashpartitioning(uid#112, tag#117, 200)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[partial_count(1), partial_sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, count#126L, sum#127L])
+- *Project [id#82 AS uid#112, tag#117, favorite_count#9]
+- Generate explode(tags#2), true, false, [tag#117]
+- *Project [id#82, tags#2, favorite_count#9]
+- *SortMergeJoin [id#82], [owner_user_id#3], Inner
:- *Sort [id#82 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#82, 200)
: +- *Project [id#82]
: +- *Filter (isnotnull(country#89) && isnotnull(id#82))
: +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
+- *Sort [owner_user_id#3 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(owner_user_id#3, 200)
+- *Project [tags#2, owner_user_id#3, favorite_count#9]
+- *Filter isnotnull(owner_user_id#3)
+- *FileScan parquet [tags#2,owner_user_id#3,favorite_count#9,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id)], ReadSchema: struct<tags:array<string>,owner_user_id:int,favorite_count:int>
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.InputAdapter.doExecute(WholeStageCodegenExec.scala:244)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.joins.SortMergeJoinExec.inputRDDs(SortMergeJoinExec.scala:377)
at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2861)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2150)
at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2842)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:242)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:83)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:235)
at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:263)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:235)
at org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:128)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:77)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:38)
at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:46)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:36)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:331)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:372)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)
at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 55 more
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 336, in show
print(self._jdf.showString(n, 20))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o132.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(uid#350, 200)
+- *Project [score#391L, numAnswers#392L, uid#350, tag#355, numQuestions#352L, favs#353L]
+- *BroadcastHashJoin [uid#389, tag#394], [uid#350, tag#355], RightOuter, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, int, true], input[1, string, true]))
: +- InMemoryTableScan [uid#389, tag#394, score#391L, numAnswers#392L]
: +- InMemoryRelation [uid#389, tag#394, score#391L, numAnswers#392L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[sum(cast(score#33 as bigint)), count(1)], output=[uid#175, tag#180, score#177L, numAnswers#178L])
: +- Exchange hashpartitioning(uid#175, tag#180, 200)
: +- *HashAggregate(keys=[uid#175, tag#180], functions=[partial_sum(cast(score#33 as bigint)), partial_count(1)], output=[uid#175, tag#180, sum#189L, count#190L])
: +- *Project [id#82 AS uid#175, tag#180, score#33]
: +- Generate explode(tags#2), true, false, [tag#180]
: +- *Project [id#82, score#33, tags#2]
: +- *SortMergeJoin [parent_id#32], [id#0], Inner
: :- *Sort [parent_id#32 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(parent_id#32, 200)
: : +- *Project [id#82, parent_id#32, score#33]
: : +- *SortMergeJoin [id#82], [owner_user_id#31], Inner
: : :- *Sort [id#82 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(id#82, 200)
: : : +- *Project [id#82]
: : : +- *Filter (isnotnull(country#89) && isnotnull(id#82))
: : : +- *FileScan parquet [id#82,country#89] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/usersFixedCountries.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(country), IsNotNull(id)], ReadSchema: struct<id:int,country:string>
: : +- *Sort [owner_user_id#31 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(owner_user_id#31, 200)
: : +- *Project [owner_user_id#31, parent_id#32, score#33]
: : +- *Filter (isnotnull(owner_user_id#31) && isnotnull(parent_id#32))
: : +- *FileScan parquet [owner_user_id#31,parent_id#32,score#33,creation_year#36] Batched: true, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/answers.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(owner_user_id), IsNotNull(parent_id)], ReadSchema: struct<owner_user_id:int,parent_id:int,score:int>
: +- *Sort [id#0 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#0, 200)
: +- *Project [id#0, tags#2]
: +- *Filter isnotnull(id#0)
: +- *FileScan parquet [id#0,tags#2,creation_year#12] Batched: false, Format: Parquet, Location: InMemoryFileIndex[wasb://data@cs4225.blob.core.windows.net/parquet/questions.parquet], PartitionCount: 11, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,tags:array<string>>
+- *Filter isnotnull(uid#350)
+- InMemoryTableScan [uid#350, tag#355, numQuestions#352L, favs#353L], [isnotnull(uid#350)]
+- InMemoryRelation [uid#350, tag#355, numQuestions#352L, favs#353L], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[count(1), sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, numQuestions#114L, favs#115L])
+- Exchange hashpartitioning(uid#112, tag#117, 200)
+- *HashAggregate(keys=[uid#112, tag#117], functions=[partial_count(1), partial_sum(cast(favorite_count#9 as bigint))], output=[uid#112, tag#117, count#126L, sum#127L])
+- *Project [id#82 AS uid#112, tag#117, favorite_count#9]
+- Generate explode(tags#2), true, false, [tag#117]
+- *Project [id#82, tags#2, favorite_count#9]
+- *SortMergeJoin [id#82], [owner_user_id#3], Inner
:- *Sort [id#82 ASC NULLS FIRST], false, 0
: +- Exchange