如何使用BigQuery将垂直数据转置为水平数据?

时间:2019-08-23 13:22:08

标签: sql google-bigquery

我目前正在处理员工福利数据。但是,电子表格数据完全是一团糟。我想将其格式化为易于捕获的信息。 当前的格式如下:

Relationship EmployeeName  BenefitCode  BenefitOption  Name  
               Alice          DEN         EEC           
  CHL          Alice          DEN         EEC          John
  SPS          Alice          MED                      Lee
               Lily           VIS                      
  SPS          Lily           VIS                       Tom

我想这样转移它:

Relationship    Name     MED    DEN    VIS 
Employee        Alice           EEC
CHL             John            EEC
SPS             Lee      MED
Employee        Lily                   VIS
SPS             Tom                    VIS

我尝试按名称和BenefitCode对数据进行分组,但是我对此感到非常困惑。

我的代码如下:

SELECT   RelationshipCode, EmployeeName, 
         MAX(IF(BenefitCode = "DEN", BenefitOptionCode , NULL)) AS DEN,
         MAX(IF(BenefitCode = "MED", BenefitOptionCode , NULL)) AS MEDICAL,
         MAX(IF(BenefitCode = "VIS", BenefitOptionCode , NULL)) AS VISION
FROM `TableXXX` 
WHERE RelationshipCode = 'Employee'
GROUP BY EmployeeName, RelationshipCode

但是,失去与雇员的亲属关系似乎不是一个好主意。 谁能告诉我如何将垂直数据转换为水平数据?还是您有解决这个问题的好主意?

2 个答案:

答案 0 :(得分:2)

以下是用于BigQuery标准SQL

#standardSQL
SELECT 
  EmployeeName,
  IF(Relationship IS NULL, 'Self', Relationship) Relationship, 
  IFNULL(Name, EmployeeName) Name, 
  MAX(IF(BenefitCode = 'DEN', IFNULL(BenefitOption, BenefitCode), NULL)) AS DEN,
  MAX(IF(BenefitCode = 'MED', IFNULL(BenefitOption, BenefitCode), NULL)) AS MEDICAL,
  MAX(IF(BenefitCode = 'VIS', IFNULL(BenefitOption, BenefitCode), NULL)) AS VISION  
FROM `project.dataset.table`
GROUP BY Name, EmployeeName, Relationship 
-- ORDER BY Name, Relationship

如果要应用于您的问题的样本数据-结果为

Row EmployeeName    Relationship    Name    DEN     MEDICAL VISION   
1   Alice           Self            Alice   EEC     null    null     
2   Alice           CHL             John    EEC     null    null     
3   Alice           SPS             Lee     null    MED     null     
4   Lily            Self            Lily    null    null    VIS  
5   Lily            SPS             Tom     null    null    VIS    

另一个选择是将扩展版本扩展到“分层”

#standardSQL
SELECT EmployeeName,
  ARRAY_AGG(STRUCT(Name, Relationship, DEN, MEDICAL, VISION)) benefits
FROM (
  SELECT 
    EmployeeName,
    IF(Relationship IS NULL, 'Self', Relationship) Relationship, 
    IFNULL(Name, EmployeeName) Name, 
    MAX(IF(BenefitCode = 'DEN', IFNULL(BenefitOption, BenefitCode), NULL)) AS DEN,
    MAX(IF(BenefitCode = 'MED', IFNULL(BenefitOption, BenefitCode), NULL)) AS MEDICAL,
    MAX(IF(BenefitCode = 'VIS', IFNULL(BenefitOption, BenefitCode), NULL)) AS VISION  
  FROM `project.dataset.table`
  GROUP BY Name, EmployeeName, Relationship 
) 
GROUP BY EmployeeName
-- ORDER BY EmployeeName

在这种情况下,结果将是

Row EmployeeName    benefits.Name   benefits.Relationship   benefits.DEN    benefits.MEDICAL    benefits.VISION  
1   Alice           Alice           Self                    EEC             null                null     
                    John            CHL                     EEC             null                null     
                    Lee             SPS                     null            MED                 null       
2   Lily            Lily            Self                    null            null                VIS  
                    Tom             SPS                     null            null                VIS  

答案 1 :(得分:0)

我可能会将其组织成CTE,使每个列(或概念)成为自己的逻辑CTE。

with people as (
  select distinct EmployeeName as person from <dataset>.<table> union distinct
  select distinct Name as person from <dataset>.table
),
med as (
  -- select people with MED columns
),
den as (
  -- select people with DEN columns
),
... (etc)
joined as (
  select * from people
  left join med using(person)
  left join den using(person)
)
select * from joined

对于这种情况,我的一般建议是从您了解的内容开始(例如我从MED和DEN开始的方式)。这些简单的项目完成后,您将转到更复杂或需要假设的项目。将它们分解为CTE块有助于封装每个想法。

我们显然也不知道您的数据,甚至不知道这是否是一项实际任务,但是您可能需要注意一些警告,需要更详细的逻辑(同名的人,多代人的关系等等)