Question

我在spark-sql中创建了两个外部表。一个文件格式为parquet，另一个文件格式为textfile。

当我们在这两个表上提取查询计划时，spark会对这两个表进行不同的处理。

镶木地板上的查询计划输出为：

== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p
      +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet

== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none)
   +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none)

== Physical Plan ==
*FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...

csv表上查询计划的输出为：

 == Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p3`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p3
      +- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9]

== Optimized Logical Plan ==
Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=1134.0 B, rowCount=3, hints=none)
+- HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], Statistics(sizeInBytes=9.6 KB, rowCount=128, hints=none)

== Physical Plan ==
HiveTableScan [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], HiveTableRelation `default`.`test_p3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8], [Country#9], [isnotnull(country#9), (country#9 = Korea)]

为什么有区别呢？ Spark版本：2.2.1

Answer 1

从逻辑上讲，它们没有区别对待。

但是它们的内部格式不同，镶木地板经过柱状优化，因此可以采用不同的方法。例如。在PARQUET中修剪。

Spark SQL以不同的方式读取实木复合地板表和csv表

1 个答案: