我喜欢Azure Data Lake,但缺乏文档可能会减慢采用率。我希望有人在U-SQL上比我更有经验。
尝试从Microsoft.Analytics.Interfaces下的可用内容和U-SQL解释器中获取并没有太多运气。似乎不支持动态sql在运行时定义行集的模式,而IUpdatableRow的模式是只读的,因此处理器方法不可行。并且在U-SQL中没有开箱即用的PIVOT功能。
我还想过,也许我可以一起处理行集并编写一个自定义输出器来转动,但无法弄明白。
这可能是一种非常简单的方法,因为它是标准的枢轴操作。您将如何以高效的方式将I行的行集重新设置为不确定数量的ColA和ColB值?
|ColA |ColB |ColC|
|1 |A |30 |
|1 |B |70 |
|1 |ZA |12 |
|2 |C |22 |
|2 |A |13 |
|ID |A |B |C |...... |ZA |.....
|1 |30 |70 |0 | |12 |
|2 |13 |0 |22 |...... |0 |.....
答案 0 :(得分:3)
您可以选择执行此类PIVOT
。
这是一个使用U-SQL MAP数据类型(称为SQL.MAP
)的数据。而不是0,它将为缺失值返回null(使用null coalesce表达式将其变为0)这将在以下条件下工作:
地图解决方案:
@t = SELECT *
FROM(
VALUES
( 1, "A", 30 ),
( 1, "B", 70 ),
( 1, "ZA", 12 ),
( 2, "C", 22 ),
( 2, "A", 13 ),
( 2, "ABC", 42)
) AS T(ColA, ColB, ColC);
@m = SELECT ColA AS [ID],
MAP_AGG(ColB, (int?) ColC) AS m
FROM @t
GROUP BY ColA;
@r =
SELECT [ID],
m["A"]AS A,
m["B"]AS B,
m["C"]AS C,
m["ZA"]AS [ZA],
m["ABC"]AS [ABC]
FROM @m;
OUTPUT @r
TO "/output/pivot1.csv"
USING Outputters.Csv();
这是一个使用标准SQL pivot解决方案模式的解决方案(一些SQL数据库实现实际上用于在内部将PIVOT表达式转换为这样的表达式,并且仍然可以执行此操作)。同样,您必须提前知道所有列。如果不是这种情况,只需使用MAP数据类型。
@t =
SELECT *
FROM(
VALUES
( 1, "A", 30 ),
( 1, "B", 70 ),
( 1, "ZA", 12 ),
( 2, "C", 22 ),
( 2, "A", 13 ),
( 2, "ABC", 42)
) AS T(ColA, ColB, ColC);
@r =
SELECT ColA AS [ID],
(ColB == "A") ? ColC : 0 AS A,
(ColB == "B") ? ColC : 0 AS B,
(ColB == "C") ? ColC : 0 AS C,
(ColB == "ZA") ? ColC : 0 AS [ZA],
(ColB == "ABC") ? ColC : 0 AS [ABC]
FROM @t;
@r =
SELECT DISTINCT [ID],
LAST_VALUE(A) OVER(PARTITION BY [ID] ORDER BY A) AS A,
LAST_VALUE(B) OVER(PARTITION BY [ID] ORDER BY B) AS B,
LAST_VALUE(C) OVER(PARTITION BY [ID] ORDER BY C) AS C,
LAST_VALUE([ZA]) OVER(PARTITION BY [ID] ORDER BY [ZA]) AS [ZA],
LAST_VALUE([ABC]) OVER(PARTITION BY [ID] ORDER BY [ABC]) AS [ABC]
FROM @r;
OUTPUT @r
TO "/output/pivot2.csv"
USING Outputters.Csv();
答案 1 :(得分:3)
自March 2017起,已向U-SQL添加了PIVOT / UNPIVOT
语法。
使用以上样本数据:
@t = SELECT *
FROM(
VALUES
( 1, "A", 30 ),
( 1, "B", 70 ),
( 1, "ZA", 12 ),
( 2, "C", 22 ),
( 2, "A", 13 ),
( 2, "ABC", 42)
) AS T(ColA, ColB, ColC);
@p =
SELECT Column_0 AS id, Column_1 AS a
FROM @t
PIVOT (MAX(ColC) FOR ColB IN ("A" AS [A], "B" AS [B], "C" AS [C], "ZA" AS [ZA], "ABC" AS [ABC])
) AS pvt;
OUTPUT @p
TO "/output/pivot3.csv"
USING Outputters.Csv();
答案 2 :(得分:0)
以下是我的团队成员提出的一个解决方法,其中我们提前了解了一些列。
@t = SELECT *
FROM(
VALUES
( 1, "A", 30 ),
( 1, "B", 70 ),
( 1, "ZA", 12 ),
( 2, "C", 22 ),
( 2, "A", 13 ),
( 2, "ABC", 42)
) AS T(ColA, ColB, ColC);
@t1 =
SELECT DISTINCT ColB
FROM @t
ORDER BY ColB DESC
OFFSET 0 ROW;
@t1 =
SELECT ARRAY_AGG(ColB) AS ColBArray
FROM @t1;
@result =
SELECT ColA,
MAP_AGG(ColB, (int?) ColC) AS ColCMap
FROM @t
GROUP BY ColA;
@result =
SELECT a.ColA,
DPivotNS.DPivot.FillGapsAndConvert(a.ColCMap, b.ColBArray) AS Values
FROM @result AS a
CROSS JOIN
@t1 AS b;
@result =
SELECT ColA,
ArrayColumn
FROM
(
SELECT 0 AS ColA,
ColBArray AS ArrayColumn,
0 AS Ord
FROM @t1
UNION ALL
SELECT ColA AS ColA,
Values AS ArrayColumn,
1 AS Ord
FROM @result
) AS rs1
ORDER BY rs1.Ord
OFFSET 0 ROWS;
@result =
SELECT ColA,
String.Join(",", ArrayColumn) AS Values
FROM @result;
OUTPUT @result
TO "result.csv"
USING Outputters.Csv(quoting:false);
以上是脚本的UDF:
public static SqlArray<string> FillGapsAndConvert (SqlMap<string, int?> ColCMap, SqlArray<string> ColDArray)
{
var list = new LinkedList<string> ();
foreach ( string colD in ColDArray )
{
int? currentCount = ColCMap[colD];
int newCount = currentCount.HasValue ? currentCount.Value : 0;
list.AddLast (newCount.ToString ());
}
return new SqlArray<string> (list);
}