从SELECT语句定义EXTRACT范围

时间:2018-01-29 22:22:19

标签: azure-data-lake u-sql

我打算分批处理存储在ADLA中的EventHub中的数据集。对我来说,处理间隔似乎合乎逻辑,我的日期是在我的上次执行日期时间和当前执行日期时间之间。

我考虑过将执行时间戳保存在表中,以便跟踪它,并执行以下操作:

DECLARE @my_file string = @"/data/raw/my-ns/my-eh/{date:yyyy}/{date:MM}/{date:dd}/{date:HH}/{date:mm}/{date:ss}/{*}.avro";

DECLARE @max_datetime DateTime =  DateTime.Now;

@min_datetime =
    SELECT (DateTime) MAX(execution_datetime) AS min_datetime
    FROM my_adldb.dbo.watermark;

@my_json_bytes =
    EXTRACT Body byte[],
            date DateTime
    FROM @my_file
    USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(@"{""type"":""record"",""name"":""EventData"",""namespace"":""Microsoft.ServiceBus.Messaging"",""fields"":[{""name"":""SequenceNumber"",""type"":""long""},{""name"":""Offset"",""type"":""string""},{""name"":""EnqueuedTimeUtc"",""type"":""string""},{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes"",""null""]}},{""name"":""Body"",""type"":[""null"",""bytes""]}]}");

如何在EXTRACT查询中正确添加此时间间隔?我使用一个常见的WHERE子句测试它,并且手工定义了间隔,但是当它尝试使用@min_datetime时它不起作用,因为它的结果是行集。

我考虑过在后续查询中应用一些过滤,但我担心这意味着@my_json_bytes将提取我的整个数据集并在之后对其进行过滤,从而产生一个次优化的查询。

提前致谢。

1 个答案:

答案 0 :(得分:1)

您应该可以将过滤器应用为以后SELECT的一部分。 U-SQL 可以在某些条件下推送谓词,但我还没有能够测试它。尝试这样的事情:

@min_datetime =
    SELECT (DateTime) MAX(execution_datetime) AS min_datetime
    FROM my_adldb.dbo.watermark;

@my_json_bytes =
    EXTRACT Body byte[],
            date DateTime
    FROM @my_file
    USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(@"{""type"":""record"",""name"":""EventData"",""namespace"":""Microsoft.ServiceBus.Messaging"",""fields"":[{""name"":""SequenceNumber"",""type"":""long""},{""name"":""Offset"",""type"":""string""},{""name"":""EnqueuedTimeUtc"",""type"":""string""},{""name"":""SystemProperties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}},{""name"":""Properties"",""type"":{""type"":""map"",""values"":[""long"",""double"",""string"",""bytes"",""null""]}},{""name"":""Body"",""type"":[""null"",""bytes""]}]}");

@working =
    SELECT *
    FROM @my_json_bytes AS j
         CROSS JOIN
             @min_datetime AS t
    WHERE j.date > t.min_datetime;