在SparkSQL中查询时,单引号`'`,双引号`“”`和scala字符串`$“”`有什么区别?

时间:2019-11-22 20:39:50

标签: apache-spark apache-spark-sql

在sparkSQL中,我们可以在执行查询时使用字符串或Column对象。但是我注意到,有时我们会根据查询的方式获得意外结果。

上下文: 我已将此json数据存储在一行中的文件中,并且我想访问statuses.text字段。

{"statuses":[ {"created_at": "Thu Nov 21 12:00:00 +0000 2015", "id": 1197665997374836737,"id_str": "id-str-sample","text": "This is a sample text","truncated": false}],"search_metadata":{"completed_in":0.078,"max_id":15201,"max_id_str":"5213","next_results":"sample","query":"A sample query","refresh_url":"sample","count":0, "since_id":0,"since_id_str":"0"}}
// Load the data using SparkSession object
val jsonData = spark.read.json("/file/to/json")

我可以使用以下任一方式访问文本数据

jsonData.select("statuses.text") or jsonData.select($"statuses.text")
res16: org.apache.spark.sql.DataFrame = [text: array<string>]

现在事情变得古怪了。

这些查询使我出错:

jsonData.select('statuses.text)

res14: org.apache.spark.sql.DataFrame = [statuses: array<struct<created_at:string,id:bigint,id_str:string,text:string,truncated:boolean>>]
<console>:26: error: value text is not a member of Symbol
       rootTweets.select('statuses.text)
// Yet similar to the previous
jsonData.select('statuses).select('text)

org.apache.spark.sql.AnalysisException: cannot resolve '`text`' given input columns: [statuses];;
'Project ['text]
+- Project [statuses#7]
   +- Relation[search_metadata#6,statuses#7] json
/*
Same errors with jsonData.select("statuses").select("text")
*/

为什么'""$""之间有这样的区别?

备注和可能的提示 查询

jsonData.select('statuses).select('statuses)...chained x times...select('statuses)

返回相同的值。

res33: org.apache.spark.sql.DataFrame = [statuses: array<struct<created_at:string,id:bigint,id_str:string,text:string,truncated:boolean>>]

推断的数据模式为:

jsonData.printSchema

root
 |-- search_metadata: struct (nullable = true)
 |    |-- count: long (nullable = true)
... I cut the output ...
 |-- statuses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- text: string (nullable = true)
... I cut the output ...
// Select the statuses column only
val tweets = jsonData.select('statuses)
tweets.printSchema

root
 |-- statuses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created_at: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- id_str: string (nullable = true)
 |    |    |-- text: string (nullable = true)
 |    |    |-- truncated: boolean (nullable = true)

 /*
 I get the sale results with jsonData.select("statuses").printSchema and with jsonData.select($"statuses").printSchema
*/

0 个答案:

没有答案