Question

我有以下JSON文档，代表员工的信息及其地址-

[
{"id" : 1000, "name" : "dev", "age" : 30, 
     "address" : {"city":"noida","state":"UP","pincode":"201201"}},
{"id" : 1001, "name" : "ravi", "age" : 36, 
     "address" : {"city":"noida","state":"UP","pincode":"201501"}},
{"id" : 1002, "name" : "atul", "age" : 29, 
     "address" : {"city":"indore","state":"MP","pincode":"485201"}}
]

我正在使用SparkSQL读取json文件，并在“年龄”列上应用过滤器（谓词）以仅显示年龄超过29岁的员工。

 val spark = SparkSession.builder()
  .appName("JsonRead")
  .master("local[*]")
  .getOrCreate()

val emp_df = spark.read
  .option("multiline", true) // since json file having multiline records
  .json(getClass.getResource("/sparksql/employee.json").getPath)

emp_df.printSchema()
/*root
 *  |-- address: struct (nullable = true)
 *  |    |-- city: string (nullable = true)
 *  |    |-- pincode: string (nullable = true)
 *  |    |-- state: string (nullable = true)
 *  |-- age: long (nullable = true)
 *  |-- id: long (nullable = true)
 *  |-- name: string (nullable = true)
 */

//emp_df.show()
import spark.implicits._
val emp_ds = emp_df.as[Employee] // Using encoders to convert dataframe to dataset
emp_ds.filter(_.age > 29).explain(true)

}

case class Employee(id: Long, name: String, age: Long, address: Address)
case class Address(city: String, state: String, pincode: String)

查看实际计划，我在PushedFilters中没有看到任何过滤器：[]

== Physical Plan ==
*(1) Filter <function1>.apply
+- *(1) FileScan json [address#0,age#1L,id#2L,name#3] Batched: false, 
Format: JSON, Location:
InMemoryFileIndex
[file:/C:/Users/Dell/mygithub/techBlog/sparkexamples/target/cl
asses/sparksql/emp..., 
PartitionFilters: [], 
PushedFilters: [],
ReadSchema: 
struct<address:struct<city:string,pincode:string,state:string>,age:bigint,
id:bigint,name:string>

有人可以告诉我为什么谓词（年龄> 29）没有被压低吗？理想情况下，应该将其作为Spark Catalyst优化程序的一部分进行下推。

Spark-谓词未下推

0 个答案: