我在spark-sql中创建了两个外部表。一个文件格式为parquet
,另一个文件格式为textfile
。
当我们在这两个表上提取查询计划时,spark会对这两个表进行不同的处理。
镶木地板上的查询计划输出为:
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
+- 'UnresolvedRelation `test_p`
== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
+- SubqueryAlias test_p
+- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet
== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none)
+- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none)
== Physical Plan ==
*FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...
csv表上查询计划的输出为:
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
+- 'UnresolvedRelation `test_p3`
== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
+- SubqueryAlias test_p3
+- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9]
== Optimized Logical Plan ==
Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=1134.0 B, rowCount=3, hints=none)
+- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], Statistics(sizeInBytes=9.6 KB, rowCount=128, hints=none)
== Physical Plan ==
HiveTableScan [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], [isnotnull(country#9), (country#9 = Korea)]
为什么有区别呢? Spark版本:2.2.1
答案 0 :(得分:0)
从逻辑上讲,它们没有区别对待。
但是它们的内部格式不同,镶木地板经过柱状优化,因此可以采用不同的方法。例如。在PARQUET中修剪。