我正在努力研究如何在U-SQL中制作“多行”公式。我按日期对数据进行了排序,对于每个数据,我想找到“Port”的第一个值,它不等于当前行的值。以类似的方式,我想找到日期值中的最后一行,使用当前端口值来计算船只在港口中的天数。请记住,这必须是具有相同端口名称的行,中间没有新的/其他端口。
我正在加载我的数据:
@res = SELECT
Port,
Date
FROM @data;
这就是我日期的结构:
Port | Date |
Port A | 1/1/2017 |
Port A | 1/1/2017 |
Port A | 1/2/2017 |
Port B | 1/4/2017 |
Port B | 1/4/2017 |
Port B | 1/4/2017 |
Port B | 1/5/2017 |
Port B | 1/6/2017 |
Port C | 1/9/2017 |
Port C | 1/10/2017 |
Port C | 1/11/2017 |
Port A | 1/14/2017 |
Port A | 1/15/2017 |
我希望如何构建数据:
Port | Date | Time in Port | Previous Port
Port A | 1/1/2017 | 0 | N/A
Port A | 1/1/2017 | 0 | N/A
Port A | 1/2/2017 | 1 | N/A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/5/2017 | 1 | Port A
Port B | 1/6/2017 | 2 | Port A
Port C | 1/9/2017 | 0 | Port B
Port C | 1/10/2017 | 1 | Port B
Port C | 1/11/2017 | 2 | Port B
Port A | 1/14/2017 | 0 | Port C
Port A | 1/15/2017 | 1 | Port C
我是U-SQL的新手,所以我在如何解决这个问题上遇到了一些麻烦。 我的第一直觉是使用LEAD()/ LAG()和ROW_NUMBER()OVER(PARTITION BY xx ORDER BY Date)的某种组合,但我不确定如何获得我正在寻找的确切效果。
有人能指出我正确的方向吗?
答案 0 :(得分:1)
您可以使用所谓的LAG
和Ranking函数执行所需操作,例如DENSE_RANK
,OVER
和// Test data
@input = SELECT *
FROM (
VALUES
( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ),
( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ),
( "Port A", DateTime.Parse("1/2/2017", new CultureInfo("en-US") ), 1 ),
( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
( "Port B", DateTime.Parse("1/5/2017", new CultureInfo("en-US") ), 1 ),
( "Port B", DateTime.Parse("1/6/2017", new CultureInfo("en-US") ), 2 ),
( "Port C", DateTime.Parse("1/9/2017", new CultureInfo("en-US") ), 0 ),
( "Port C", DateTime.Parse("1/10/2017", new CultureInfo("en-US") ), 1 ),
( "Port C", DateTime.Parse("1/11/2017", new CultureInfo("en-US") ), 2 ),
( "Port A", DateTime.Parse("1/14/2017", new CultureInfo("en-US") ), 0 ),
( "Port A", DateTime.Parse("1/15/2017", new CultureInfo("en-US") ), 1 )
) AS x ( Port, Date, timeInPort );
// Add a group id to the dataset
@working =
SELECT Port,
Date,
timeInPort,
DENSE_RANK() OVER(ORDER BY Date) - DENSE_RANK() OVER(PARTITION BY Port ORDER BY Date) AS groupId
FROM @input;
// Use the group id to work out the datediff with previous row
@working =
SELECT Port,
Date,
timeInPort,
groupId,
Date.Date.Subtract((DateTime)(LAG(Date) OVER(PARTITION BY groupId ORDER BY Date) ?? Date)).TotalDays AS diff // datediff
FROM @working;
// Work out the previous port, based on group id
@ports =
SELECT Port, groupId
FROM @working
GROUP BY Port, groupId;
@ports =
SELECT Port, groupId, LAG(Port) OVER( ORDER BY groupId ) AS previousPort
FROM @ports;
// Prep the final output
@output =
SELECT w.Port,
w.Date.ToString("M/d/yyyy") AS Date,
SUM(w.diff) OVER( PARTITION BY w.groupId ORDER BY w.Date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS timeInPort,
p.previousPort
FROM @working AS w
INNER JOIN
@ports AS p
ON w.Port == p.Port
AND w.groupId == p.groupId;
OUTPUT @output TO "/output/output.csv"
ORDER BY Date, Port
USING Outputters.Csv(quoting:false);
子句,尽管它不是完全是直截了当的。这个简单的装备适用于您的测试数据,我建议使用更大更复杂的数据集进行彻底测试。
for(i=1;i<=2;++i)
我的结果: