azure data lake u-sql pivot

时间:2015-11-29 00:51:50

标签: azure-data-lake u-sql

我喜欢Azure Data Lake,但缺乏文档可能会减慢采用率。我希望有人在U-SQL上比我更有经验。

尝试从Microsoft.Analytics.Interfaces下的可用内容和U-SQL解释器中获取并没有太多运气。似乎不支持动态sql在运行时定义行集的模式,而IUpdatableRow的模式是只读的,因此处理器方法不可行。并且在U-SQL中没有开箱即用的PIVOT功能。

我还想过,也许我可以一起处理行集并编写一个自定义输出器来转动,但无法弄明白。

这可能是一种非常简单的方法,因为它是标准的枢轴操作。您将如何以高效的方式将I行的行集重新设置为不确定数量的ColA和ColB值?

I

|ColA |ColB |ColC|
|1    |A    |30  |
|1    |B    |70  |
|1    |ZA   |12  |
|2    |C    |22  |
|2    |A    |13  |

II

|ID   |A    |B    |C   |...... |ZA   |.....
|1    |30   |70   |0   |       |12   |
|2    |13   |0    |22  |...... |0    |.....

3 个答案:

答案 0 :(得分:3)

您可以选择执行此类PIVOT

这是一个使用U-SQL MAP数据类型(称为SQL.MAP)的数据。而不是0,它将为缺失值返回null(使用null coalesce表达式将其变为0)这将在以下条件下工作:

  1. 生成的MAP保持在4MB的行大小限制内。如果不, 看下一个解决方案。
  2. 你提前知道,你有哪些专栏 (如果没有,只需将数据保存在地图列中并提取为 需要)。
  3. 地图解决方案:

    @t = SELECT *
         FROM(
            VALUES
            ( 1, "A", 30 ),
            ( 1, "B", 70 ),
            ( 1, "ZA", 12 ),
            ( 2, "C", 22 ),
            ( 2, "A", 13 ),
            ( 2, "ABC", 42)
         ) AS T(ColA, ColB, ColC);
    
    @m = SELECT ColA AS [ID],
                MAP_AGG(ColB, (int?) ColC) AS m
         FROM @t
         GROUP BY ColA;
    
    @r =
        SELECT [ID],
               m["A"]AS A,
               m["B"]AS B,
               m["C"]AS C,
               m["ZA"]AS [ZA],
               m["ABC"]AS [ABC]
        FROM @m;
    
    OUTPUT @r
    TO "/output/pivot1.csv"
    USING Outputters.Csv();
    

    这是一个使用标准SQL pivot解决方案模式的解决方案(一些SQL数据库实现实际上用于在内部将PIVOT表达式转换为这样的表达式,并且仍然可以执行此操作)。同样,您必须提前知道所有列。如果不是这种情况,只需使用MAP数据类型。

    @t =
        SELECT *
        FROM(
            VALUES
            ( 1, "A", 30 ),
            ( 1, "B", 70 ),
            ( 1, "ZA", 12 ),
            ( 2, "C", 22 ),
            ( 2, "A", 13 ),
            ( 2, "ABC", 42)
        ) AS T(ColA, ColB, ColC);
    
    @r =
        SELECT ColA AS [ID],
               (ColB == "A") ? ColC : 0 AS A,
               (ColB == "B") ? ColC : 0 AS B,
               (ColB == "C") ? ColC : 0 AS C,
               (ColB == "ZA") ? ColC : 0 AS [ZA],
               (ColB == "ABC") ? ColC : 0 AS [ABC]
        FROM @t;
    
    @r =
        SELECT DISTINCT [ID],
               LAST_VALUE(A) OVER(PARTITION BY [ID] ORDER BY A) AS A,
               LAST_VALUE(B) OVER(PARTITION BY [ID] ORDER BY B) AS B,
               LAST_VALUE(C) OVER(PARTITION BY [ID] ORDER BY C) AS C,
               LAST_VALUE([ZA]) OVER(PARTITION BY [ID] ORDER BY [ZA]) AS [ZA],
               LAST_VALUE([ABC]) OVER(PARTITION BY [ID] ORDER BY [ABC]) AS [ABC]
        FROM @r;
    
    OUTPUT @r
    TO "/output/pivot2.csv"
    USING Outputters.Csv();
    

答案 1 :(得分:3)

March 2017起,已向U-SQL添加了PIVOT / UNPIVOT语法。

使用以上样本数据:

@t = SELECT *
     FROM(
        VALUES
        ( 1, "A", 30 ),
        ( 1, "B", 70 ),
        ( 1, "ZA", 12 ),
        ( 2, "C", 22 ),
        ( 2, "A", 13 ),
        ( 2, "ABC", 42)
     ) AS T(ColA, ColB, ColC);


@p =
    SELECT Column_0 AS id, Column_1 AS a
    FROM @t
      PIVOT (MAX(ColC) FOR ColB IN ("A" AS [A], "B" AS [B], "C" AS [C], "ZA" AS [ZA], "ABC" AS [ABC])
           ) AS pvt;


OUTPUT @p
TO "/output/pivot3.csv"
USING Outputters.Csv();

答案 2 :(得分:0)

以下是我的团队成员提出的一个解决方法,其中我们提前了解了一些列。

@t = SELECT *
     FROM(
        VALUES
        ( 1, "A", 30 ),
        ( 1, "B", 70 ),
        ( 1, "ZA", 12 ),
        ( 2, "C", 22 ),
        ( 2, "A", 13 ),
        ( 2, "ABC", 42)
     ) AS T(ColA, ColB, ColC);

@t1 =
    SELECT DISTINCT ColB
    FROM @t
ORDER BY ColB DESC
OFFSET 0 ROW;

@t1 =
    SELECT ARRAY_AGG(ColB) AS ColBArray
    FROM @t1;

@result =
    SELECT ColA,
           MAP_AGG(ColB, (int?) ColC) AS ColCMap
    FROM @t
    GROUP BY ColA;

@result =
    SELECT a.ColA,
           DPivotNS.DPivot.FillGapsAndConvert(a.ColCMap, b.ColBArray) AS Values
    FROM @result AS a
         CROSS JOIN
             @t1 AS b;

@result =
    SELECT ColA,
           ArrayColumn
    FROM
    (
    SELECT 0 AS ColA,
           ColBArray AS ArrayColumn,
           0 AS Ord
    FROM @t1
    UNION ALL
    SELECT ColA AS ColA,
           Values AS ArrayColumn,
           1 AS Ord
    FROM @result
    ) AS rs1
ORDER BY rs1.Ord
OFFSET 0 ROWS;

@result =
    SELECT ColA,
           String.Join(",", ArrayColumn) AS Values
    FROM @result;


OUTPUT @result
TO "result.csv"
USING Outputters.Csv(quoting:false);

以上是脚本的UDF:

    public static SqlArray<string> FillGapsAndConvert (SqlMap<string, int?> ColCMap, SqlArray<string> ColDArray)
        {
        var list = new LinkedList<string> ();
        foreach ( string colD in ColDArray )
            {
            int? currentCount = ColCMap[colD];
            int newCount = currentCount.HasValue ? currentCount.Value : 0;
            list.AddLast (newCount.ToString ());
            }
        return new SqlArray<string> (list);
        }