在Databricks“Community Edition”的Python笔记本中,我正在试验旧金山市开放数据,这些数据是关于911请求消防员的紧急呼叫。 ("Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data"(YouTube)中使用的2016年旧数据副本,并在S3上为该教程提供。)
挂载数据并使用显式定义的模式将其读入DataFrame fire_service_calls_df
后,我将该DataFrame别名为SQL表:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
使用它和DataFrame API,我可以计算发生的呼叫类型:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
...或Python中的SQL:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+ |count(DISTINCT CallType)| +------------------------+ | 33| +------------------------+
...或SQL单元格:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
为什么我会得到两个不同的计数结果?(似乎 34是正确的,即使talk in the video和随附的教程笔记本提到“35”。)
答案 0 :(得分:4)
回答问题
Spark SQL无法正确计数,还是我无法正确编写SQL?
标题:我无法正确编写SQL。
编写SQL的规则<插入编号> :考虑NULL
和UNDEFINED
。
%sql
SELECT count(*)
FROM (
SELECT DISTINCT CallType
FROM fireServiceCalls
)
34
此外,我显然无法阅读:
pault suggested in a comment
只有30个值,您可以对所有不同的项进行排序和打印,以查看差异所在。
嗯,我实际上已经想到了这一点。 (减去排序。)除了,没有任何区别,输出中总是有34种调用类型,无论我是用SQL还是DataFrame查询生成它。我根本没有注意到其中一个名字是null
:
+--------------------------------------------+ |CallType | +--------------------------------------------+ |Elevator / Escalator Rescue | |Marine Fire | |Aircraft Emergency | |Confined Space / Structure Collapse | |Administrative | |Alarms | |Odor (Strange / Unknown) | |Lightning Strike (Investigation) | |null | |Citizen Assist / Service Call | |HazMat | |Watercraft in Distress | |Explosion | |Oil Spill | |Vehicle Fire | |Suspicious Package | |Train / Rail Fire | |Extrication / Entrapped (Machinery, Vehicle)| |Other | |Transfer | |Outside Fire | |Traffic Collision | |Assist Police | |Gas Leak (Natural and LP Gases) | |Water Rescue | |Electrical Hazard | |High Angle Rescue | |Structure Fire | |Industrial Accidents | |Medical Incident | |Mutual Aid / Assist Outside Agency | |Fuel Spill | |Smoke Investigation (Outside) | |Train / Rail Incident | +--------------------------------------------+