Question

在Databricks“Community Edition”的Python笔记本中，我正在试验旧金山市开放数据，这些数据是关于911请求消防员的紧急呼叫。（"Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data"（YouTube）中使用的2016年旧数据副本，并在S3上为该教程提供。）

挂载数据并使用显式定义的模式将其读入DataFrame fire_service_calls_df后，我将该DataFrame别名为SQL表：

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

使用它和DataFrame API，我可以计算发生的呼叫类型：

fire_service_calls_df.select('CallType').distinct().count()

Out[n]: 34

...或Python中的SQL：

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()

+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

...或SQL单元格：

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

为什么我会得到两个不同的计数结果？（似乎 34是正确的，即使talk in the video和随附的教程笔记本提到“35”。）

Answer 1

回答问题

Spark SQL无法正确计数，还是我无法正确编写SQL？

标题：我无法正确编写SQL。

编写SQL的规则＆lt;插入编号＆gt; ：考虑NULL和UNDEFINED。

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)

34

此外，我显然无法阅读：

pault suggested in a comment

只有30个值，您可以对所有不同的项进行排序和打印，以查看差异所在。

嗯，我实际上已经想到了这一点。（减去排序。）除了，没有任何区别，输出中总是有34种调用类型，无论我是用SQL还是DataFrame查询生成它。我根本没有注意到其中一个名字是null：

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+

Spark SQL可以正确计数还是不能正确编写SQL？

1 个答案: