Question

我有一个表population_table，其中包含具有user_id，provider_name和city的列。我想计算每个提供商在每个城市中用户出现的次数。因此，例如，我希望输出看起来像这样：

provider_name |  Users |  Atlanta | Chicago | New York
______________________________________________________
Alpha            100        50         25        25
Beta             200       100         75        25
Kappa            500       300        100       100

我尝试使用：

select provider_name, count (distinct user_id) AS Users, count(city) AS City 
from population_table
group by provider_name

如何编写此查询以获取每个城市每个提供商的用户细分？

Answer 1

我认为您需要条件聚合。根据您的描述，不清楚count(distinct)是否必要。所以我会先尝试一下：

select provider_name, count(*) AS Users,
       sum(case when city = 'Atlanta' then 1 else 0 end) as Atlanta,
       sum(case when city = 'Chicago' then 1 else 0 end) as Chicago,
       sum(case when city = 'New York' then 1 else 0 end) as New_York
from population_table
group by provider_name;

如果需要count(distinct)：

select provider_name, count(distinct user_id) AS Users,
       count(distinct case when city = 'Atlanta' then user_id end) as Atlanta,
       count(distinct case when city = 'Chicago' then user_id end) as Chicago,
       count(distinct case when city = 'New York' then user_id end) as New_York
from population_table
group by provider_name

Answer 2

如果您的城市数量不固定，我将不知道如何在SparkSQL中提供列表。但是使用pyspark，您可以像这样从table创建输出input：

counts = input.groupBy('provider_name', 'city').count().cache()
countsPerProvider = counts.groupBy('provider_name').count().withColumnRenamed("count", "users")
pivoted = counts.groupBy("provider_name").pivot("city").sum('count')
table = pivoted.join(countsPerProvider, pivoted["provider_name"] == countsPerProvider["provider_name"]).select(pivoted["*"], countsPerProvider["users"])

如何计算特定文本字符串出现的次数并按其他列分组

2 个答案: