在sparkSQL中,我们可以在执行查询时使用字符串或Column对象。但是我注意到,有时我们会根据查询的方式获得意外结果。
上下文:
我已将此json
数据存储在一行中的文件中,并且我想访问statuses.text
字段。
{"statuses":[ {"created_at": "Thu Nov 21 12:00:00 +0000 2015", "id": 1197665997374836737,"id_str": "id-str-sample","text": "This is a sample text","truncated": false}],"search_metadata":{"completed_in":0.078,"max_id":15201,"max_id_str":"5213","next_results":"sample","query":"A sample query","refresh_url":"sample","count":0, "since_id":0,"since_id_str":"0"}}
// Load the data using SparkSession object
val jsonData = spark.read.json("/file/to/json")
我可以使用以下任一方式访问文本数据
jsonData.select("statuses.text") or jsonData.select($"statuses.text")
res16: org.apache.spark.sql.DataFrame = [text: array<string>]
现在事情变得古怪了。
这些查询使我出错:
jsonData.select('statuses.text)
res14: org.apache.spark.sql.DataFrame = [statuses: array<struct<created_at:string,id:bigint,id_str:string,text:string,truncated:boolean>>]
<console>:26: error: value text is not a member of Symbol
rootTweets.select('statuses.text)
// Yet similar to the previous
jsonData.select('statuses).select('text)
org.apache.spark.sql.AnalysisException: cannot resolve '`text`' given input columns: [statuses];;
'Project ['text]
+- Project [statuses#7]
+- Relation[search_metadata#6,statuses#7] json
/*
Same errors with jsonData.select("statuses").select("text")
*/
为什么'
和""
或$""
之间有这样的区别?
备注和可能的提示 查询
jsonData.select('statuses).select('statuses)...chained x times...select('statuses)
返回相同的值。
res33: org.apache.spark.sql.DataFrame = [statuses: array<struct<created_at:string,id:bigint,id_str:string,text:string,truncated:boolean>>]
推断的数据模式为:
jsonData.printSchema
root
|-- search_metadata: struct (nullable = true)
| |-- count: long (nullable = true)
... I cut the output ...
|-- statuses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- text: string (nullable = true)
... I cut the output ...
// Select the statuses column only
val tweets = jsonData.select('statuses)
tweets.printSchema
root
|-- statuses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_at: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- text: string (nullable = true)
| | |-- truncated: boolean (nullable = true)
/*
I get the sale results with jsonData.select("statuses").printSchema and with jsonData.select($"statuses").printSchema
*/