如何让Spark使用Parquet文件中的分区信息?

时间:2016-02-11 23:07:35

标签: python-3.x apache-spark pyspark parquet

我正在尝试为某些SparkSql查询预先计算分区。如果我计算并持久化分区,Spark会使用它们。如果我将分区数据保存到Parquet并稍后重新加载,分区信息就会消失,Spark会重新计算它。

实际数据足够大,需要花费大量时间进行分区。下面的代码充分说明了问题。 Test2()是目前我唯一可以开始工作的东西,但是我想快速启动实际的处理,这就是test3()试图做的事情。

任何人都知道我做错了什么? ..或者,如果这是Spark可以做的事情?

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

# NOTE: Need to have python in PATH, SPARK_HOME set to location of spark, HADOOP_HOME set to location of winutils
if __name__ == "__main__":
    sc = SparkContext(appName="PythonPartitionBug")
    sql_text = "select foo, bar from customer c, orders o where foo < 300 and c.custkey=o.custkey"

    def setup():
        sqlContext = SQLContext(sc)
        fields1 = [StructField(name, IntegerType()) for name in ['custkey', 'foo']]
        data1 = [(1, 110), (2, 210), (3, 310), (4, 410), (5, 510)]
        df1 = sqlContext.createDataFrame(data1, StructType(fields1))
        df1.persist()
        fields2 = [StructField(name, IntegerType()) for name in ['orderkey', 'custkey', 'bar']]
        data2 = [(1, 1, 10), (2, 1, 20), (3, 2, 30), (4, 3, 40), (5, 4, 50)]
        df2 = sqlContext.createDataFrame(data2, StructType(fields2))
        df2.persist()
        return sqlContext, df1, df2

    def test1():
        # Without repartition the final plan includes hashpartitioning
        # == Physical Plan ==
        # Project [foo#1,bar#14]
        # +- SortMergeJoin [custkey#0], [custkey#13]
        #    :- Sort [custkey#0 ASC], false, 0
        #    :  +- TungstenExchange hashpartitioning(custkey#0,200), None
        #    :     +- Filter (foo#1 < 300)
        #    :        +- InMemoryColumnarTableScan [custkey#0,foo#1], [(foo#1 < 300)], InMemoryRelation [custkey#0,foo#1], true, 10000, StorageLevel(false, true, false, false, 1), ConvertToUnsafe, None
        #    +- Sort [custkey#13 ASC], false, 0
        #       +- TungstenExchange hashpartitioning(custkey#13,200), None
        #          +- InMemoryColumnarTableScan [bar#14,custkey#13], InMemoryRelation [orderkey#12,custkey#13,bar#14], true, 10000, StorageLevel(false, true, false, false, 1), ConvertToUnsafe, None
        sqlContext, df1, df2 = setup()
        df1.registerTempTable("customer")
        df2.registerTempTable("orders")
        df3 = sqlContext.sql(sql_text)
        df3.collect()
        df3.explain(True)

    def test2():
        # With repartition the final plan does not include hashpartitioning
        # == Physical Plan ==
        # Project [foo#56,bar#69]
        # +- SortMergeJoin [custkey#55], [custkey#68]
        #    :- Sort [custkey#55 ASC], false, 0
        #    :  +- Filter (foo#56 < 300)
        #    :     +- InMemoryColumnarTableScan [custkey#55,foo#56], [(foo#56 < 300)], InMemoryRelation [custkey#55,foo#56], true, 10000, StorageLevel(false, true, false, false, 1), TungstenExchange hashpartitioning(custkey#55,4), None, None
        #    +- Sort [custkey#68 ASC], false, 0
        #       +- InMemoryColumnarTableScan [bar#69,custkey#68], InMemoryRelation [orderkey#67,custkey#68,bar#69], true, 10000, StorageLevel(false, true, false, false, 1), TungstenExchange hashpartitioning(custkey#68,4), None, None
        sqlContext, df1, df2 = setup()
        df1a = df1.repartition(4, 'custkey').persist()
        df1a.registerTempTable("customer")

        df2a = df2.repartition(4, 'custkey').persist()
        df2a.registerTempTable("orders")

        df3 = sqlContext.sql(sql_text)
        df3.collect()
        df3.explain(True)

    def test3():
        # After round tripping the partitioned data, the partitioning is lost and spark repartitions
        # == Physical Plan ==
        # Project [foo#223,bar#284]
        # +- SortMergeJoin [custkey#222], [custkey#283]
        #    :- Sort [custkey#222 ASC], false, 0
        #    :  +- TungstenExchange hashpartitioning(custkey#222,200), None
        #    :     +- Filter (foo#223 < 300)
        #    :        +- InMemoryColumnarTableScan [custkey#222,foo#223], [(foo#223 < 300)], InMemoryRelation [custkey#222,foo#223], true, 10000, StorageLevel(false, true, false, false, 1), Scan ParquetRelation[custkey#222,foo#223] InputPaths: file:/E:/.../df1.parquet, None
        #    +- Sort [custkey#283 ASC], false, 0
        #       +- TungstenExchange hashpartitioning(custkey#283,200), None
        #          +- InMemoryColumnarTableScan [bar#284,custkey#283], InMemoryRelation [orderkey#282,custkey#283,bar#284], true, 10000, StorageLevel(false, true, false, false, 1), Scan ParquetRelation[orderkey#282,custkey#283,bar#284] InputPaths: file:/E:/.../df2.parquet, None
        sqlContext, df1, df2 = setup()
        df1a = df1.repartition(4, 'custkey').persist()
        df1a.write.parquet("df1.parquet", mode='overwrite')
        df1a = sqlContext.read.parquet("df1.parquet")
        df1a.persist()
        df1a.registerTempTable("customer")

        df2a = df2.repartition(4, 'custkey').persist()
        df2a.write.parquet("df2.parquet", mode='overwrite')
        df2a = sqlContext.read.parquet("df2.parquet")
        df2a.persist()
        df2a.registerTempTable("orders")

        df3 = sqlContext.sql(sql_text)
        df3.collect()
        df3.explain(True)

    test1()
    test2()
    test3()
    sc.stop()

2 个答案:

答案 0 :(得分:2)

您没有做错任何事 - 但您无法实现您尝试使用Spark实现的目标:用于保存文件的分区程序必然丢失写入磁盘。为什么?因为Spark没有自己的文件格式,所以它依赖于现有格式(例如Parquet,ORC或文本文件),其中任何一个都不是分区器的意识(这是Spark内部的) )所以他们不能坚持这些信息。数据 在磁盘上正确分区,但Spark无法知道从磁盘加载时使用的分区器,因此除了重新分区外别无选择。

test2()没有透露这一点的原因是你重复使用相同的DataFrame实例,这些实例存储了分区信息(在内存中)。

答案 1 :(得分:1)

更好的解决方案是使用persist(StorageLevel.MEMORY_AND_DISK_ONLY),如果RDD / DF分区从内存中逐出,它们会将RDD / DF分区溢出到工作者的本地磁盘。在这种情况下,重建分区只需要从Worker的本地磁盘中提取相对较快的数据。