我有以下JSON文档,代表员工的信息及其地址-
[
{"id" : 1000, "name" : "dev", "age" : 30,
"address" : {"city":"noida","state":"UP","pincode":"201201"}},
{"id" : 1001, "name" : "ravi", "age" : 36,
"address" : {"city":"noida","state":"UP","pincode":"201501"}},
{"id" : 1002, "name" : "atul", "age" : 29,
"address" : {"city":"indore","state":"MP","pincode":"485201"}}
]
我正在使用SparkSQL读取json文件,并在“年龄”列上应用过滤器(谓词)以仅显示年龄超过29岁的员工。
val spark = SparkSession.builder()
.appName("JsonRead")
.master("local[*]")
.getOrCreate()
val emp_df = spark.read
.option("multiline", true) // since json file having multiline records
.json(getClass.getResource("/sparksql/employee.json").getPath)
emp_df.printSchema()
/*root
* |-- address: struct (nullable = true)
* | |-- city: string (nullable = true)
* | |-- pincode: string (nullable = true)
* | |-- state: string (nullable = true)
* |-- age: long (nullable = true)
* |-- id: long (nullable = true)
* |-- name: string (nullable = true)
*/
//emp_df.show()
import spark.implicits._
val emp_ds = emp_df.as[Employee] // Using encoders to convert dataframe to dataset
emp_ds.filter(_.age > 29).explain(true)
}
case class Employee(id: Long, name: String, age: Long, address: Address)
case class Address(city: String, state: String, pincode: String)
查看实际计划,我在PushedFilters中没有看到任何过滤器:[]
== Physical Plan ==
*(1) Filter <function1>.apply
+- *(1) FileScan json [address#0,age#1L,id#2L,name#3] Batched: false,
Format: JSON, Location:
InMemoryFileIndex
[file:/C:/Users/Dell/mygithub/techBlog/sparkexamples/target/cl
asses/sparksql/emp...,
PartitionFilters: [],
PushedFilters: [],
ReadSchema:
struct<address:struct<city:string,pincode:string,state:string>,age:bigint,
id:bigint,name:string>
有人可以告诉我为什么谓词(年龄> 29)没有被压低吗?理想情况下,应该将其作为Spark Catalyst优化程序的一部分进行下推。