U-SQL如何选择与当前行不同的列中的第一个值?

时间:2017-12-06 08:50:04

标签: row-number u-sql multirow

我正在努力研究如何在U-SQL中制作“多行”公式。我按日期对数据进行了排序,对于每个数据,我想找到“Port”的第一个值,它不等于当前行的值。以类似的方式,我想找到日期值中的最后一行,使用当前端口值来计算船只在港口中的天数。请记住,这必须是具有相同端口名称的行,中间没有新的/其他端口。

我正在加载我的数据:

@res = SELECT
        Port,
        Date
        FROM @data;

这就是我日期的结构:

Port      |   Date       |
Port A    |   1/1/2017   |
Port A    |   1/1/2017   |
Port A    |   1/2/2017   |
Port B    |   1/4/2017   |
Port B    |   1/4/2017   |
Port B    |   1/4/2017   |
Port B    |   1/5/2017   |
Port B    |   1/6/2017   |
Port C    |   1/9/2017   |
Port C    |   1/10/2017  |
Port C    |   1/11/2017  |
Port A    |   1/14/2017  |
Port A    |   1/15/2017  |

我希望如何构建数据:

Port      |   Date       |  Time in Port   | Previous Port
Port A    |   1/1/2017   |      0          |   N/A
Port A    |   1/1/2017   |      0          |   N/A
Port A    |   1/2/2017   |      1          |   N/A
Port B    |   1/4/2017   |      0          |   Port  A
Port B    |   1/4/2017   |      0          |   Port  A
Port B    |   1/4/2017   |      0          |   Port  A
Port B    |   1/5/2017   |      1          |   Port  A
Port B    |   1/6/2017   |      2          |   Port  A
Port C    |   1/9/2017   |      0          |   Port  B
Port C    |   1/10/2017  |      1          |   Port  B
Port C    |   1/11/2017  |      2          |   Port  B
Port A    |   1/14/2017  |      0          |   Port  C
Port A    |   1/15/2017  |      1          |   Port  C

我是U-SQL的新手,所以我在如何解决这个问题上遇到了一些麻烦。 我的第一直觉是使用LEAD()/ LAG()和ROW_NUMBER()OVER(PARTITION BY xx ORDER BY Date)的某种组合,但我不确定如何获得我正在寻找的确切效果。

有人能指出我正确的方向吗?

1 个答案:

答案 0 :(得分:1)

您可以使用所谓的LAGRanking函数执行所需操作,例如DENSE_RANKOVER// Test data @input = SELECT * FROM ( VALUES ( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ), ( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ), ( "Port A", DateTime.Parse("1/2/2017", new CultureInfo("en-US") ), 1 ), ( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ), ( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ), ( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ), ( "Port B", DateTime.Parse("1/5/2017", new CultureInfo("en-US") ), 1 ), ( "Port B", DateTime.Parse("1/6/2017", new CultureInfo("en-US") ), 2 ), ( "Port C", DateTime.Parse("1/9/2017", new CultureInfo("en-US") ), 0 ), ( "Port C", DateTime.Parse("1/10/2017", new CultureInfo("en-US") ), 1 ), ( "Port C", DateTime.Parse("1/11/2017", new CultureInfo("en-US") ), 2 ), ( "Port A", DateTime.Parse("1/14/2017", new CultureInfo("en-US") ), 0 ), ( "Port A", DateTime.Parse("1/15/2017", new CultureInfo("en-US") ), 1 ) ) AS x ( Port, Date, timeInPort ); // Add a group id to the dataset @working = SELECT Port, Date, timeInPort, DENSE_RANK() OVER(ORDER BY Date) - DENSE_RANK() OVER(PARTITION BY Port ORDER BY Date) AS groupId FROM @input; // Use the group id to work out the datediff with previous row @working = SELECT Port, Date, timeInPort, groupId, Date.Date.Subtract((DateTime)(LAG(Date) OVER(PARTITION BY groupId ORDER BY Date) ?? Date)).TotalDays AS diff // datediff FROM @working; // Work out the previous port, based on group id @ports = SELECT Port, groupId FROM @working GROUP BY Port, groupId; @ports = SELECT Port, groupId, LAG(Port) OVER( ORDER BY groupId ) AS previousPort FROM @ports; // Prep the final output @output = SELECT w.Port, w.Date.ToString("M/d/yyyy") AS Date, SUM(w.diff) OVER( PARTITION BY w.groupId ORDER BY w.Date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS timeInPort, p.previousPort FROM @working AS w INNER JOIN @ports AS p ON w.Port == p.Port AND w.groupId == p.groupId; OUTPUT @output TO "/output/output.csv" ORDER BY Date, Port USING Outputters.Csv(quoting:false); 子句,尽管它不是完全是直截了当的。这个简单的装备适用于您的测试数据,我建议使用更大更复杂的数据集进行彻底测试。

for(i=1;i<=2;++i)

我的结果:

Analytic