Question

我正在使用Hive数据在PySpark Shell中工作。这里的目标是收集多个项目的计数器。下面我有一个示例数据帧和数据查询。我使用的资源：Is it possible to specify condition in Count()?但它适用于有限的计数器。

DriveHealth = sqlContext.sql("Select Health From testdrivestatus")

Health   |
----------
Healthy  
Healthy
Error A
Error B
Error C

这里的目标是创建计数的计数器：

健康状况良好的驱动器数量
健康状况不健康的驱动器数量，因此错误A，错误B和错误C.
具有单独健康状况的驱动器数量，因此是健康，错误A，错误B和错误C的计数器。

在这种情况下，我们会有类似的东西......

Health Counter
--------------
Healthy: 2
Unhealthy: 3
Error A: 1
Error B: 1
Error C: 1

我尝试过的东西，适用于少量病例，但我有超过60种不同的健康状况，我想知道是否有更好的方法来做到这一点：

DriveHealth = sqlContext.sql("Select 
Count( case Health when 'Healthy' then 1 else null end) as Healthy,
Count( case Health is not 'Healthy' then 1 else null end) as UnHealthy,
Count( case Health when 'Error A' then 1 else null end) as ErrorA,
... (skipping typing Through Error C)
From testdrivestatus

Answer 1

您要做的是select count(*) as total, Health from testdrivestatus group by Health您的健康专栏。

{{1}}

SQL查询分隔计数器

1 个答案: