Spark SQL可以正确计数还是不能正确编写SQL?

时间:2018-03-13 20:27:21

标签: apache-spark pyspark apache-spark-sql pyspark-sql databricks

在Databricks“Community Edition”的Python笔记本中,我正在试验旧金山市开放数据,这些数据是关于911请求消防员的紧急呼叫。 ("Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data"(YouTube)中使用的2016年旧数据副本,并在S3上为该教程提供。)

挂载数据并使用显式定义的模式将其读入DataFrame fire_service_calls_df后,我将该DataFrame别名为SQL表:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

使用它和DataFrame API,我可以计算发生的呼叫类型:

fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34

...或Python中的SQL:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

...或SQL单元格:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls
  

Databricks table output with column "count(DISTINCT CallType)" and single value "33".

为什么我会得到两个不同的计数结果?(似乎 34是正确的,即使talk in the video和随附的教程笔记本提到“35”。)

1 个答案:

答案 0 :(得分:4)

回答问题

  

Spark SQL无法正确计数,还是我无法正确编写SQL?

标题:我无法正确编写SQL。

编写SQL的规则<插入编号> :考虑NULLUNDEFINED

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)
  

34

此外,我显然无法阅读:

pault suggested in a comment

  

只有30个值,您可以对所有不同的项进行排序和打印,以查看差异所在。

嗯,我实际上已经想到了这一点。 (减去排序。)除了,没有任何区别,输出中总是有34种调用类型,无论我是用SQL还是DataFrame查询生成它。我根本没有注意到其中一个名字是null

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+