在U-SQL中添加具有特定行值的新列

时间:2017-07-11 06:13:25

标签: c# azure azure-data-lake u-sql

我必须从U-SQL中的文本文件中提取记录。第一行与其他行不同,它包含日期。我必须跳过第一行,但我必须从第一行复制日期并将其粘贴到所有行的新列中。因此,在我的最终u-sql输出查询中,每行的第一列将包含从文件的第一行复制的相同数据。有关详细信息,请参阅随附的图像文件。

enter image description here

请建议我使用正确的u-sql查询来完成此任务。

2 个答案:

答案 0 :(得分:3)

这是另一种方式。我使用了Samples中的SearchLog.tsv来演示这一点。在文件的顶部,我添加了行01JAN17 TO 31JAN17

//Skip the first row and read all the other rows
@searchlog = 
    EXTRACT UserId          int, 
            Start           DateTime, 
            Region          string, 
            Query           string, 
            Duration        int, 
            Urls            string, 
            ClickedUrls     string
    FROM @"/Samples/Data/SearchLogWithHeader.tsv"
    USING Extractors.Tsv(skipFirstNRows: 1);

//Extract all the text in the same file but don't parse out the individual columns
@searchlogAllText = 
    EXTRACT rowText string
    FROM @"/Samples/Data/SearchLogWithHeader.tsv"
    USING Extractors.Text(delimiter: '\n');

//Find a pattern that works for you and use the .NET expressions that match the string
@searchlogHeaderDate = 
    SELECT rowText.Split(' ')[1] AS FromDate FROM @searchlogAllText WHERE rowText.StartsWith("FROM");                                    

@output = SELECT * FROM @searchlogHeaderDate CROSS JOIN @searchlog;

OUTPUT @output 
    TO @"/Samples/Output/SearchLog_output.tsv"
    USING Outputters.Tsv();

输入:

FROM 01JAN17 TO 31JAN17
399266  2/15/2012 11:53:16 AM   en-us   how to make nachos  73  www.nachos.com;www.wikipedia.com    NULL
382045  2/15/2012 11:53:18 AM   en-gb   best ski resorts    614 skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm   ski-europe.com;www.travelersdigest.com/ski_resorts.htm
382045  2/16/2012 11:53:20 AM   en-gb   broken leg  74  mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture   mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture
106479  2/16/2012 11:53:50 AM   en-ca   south park episodes 24  southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall   southparkstudios.com
906441  2/16/2012 11:54:01 AM   en-us   cosmos  1213    cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos NULL
351530  2/16/2012 11:54:01 AM   en-fr   microsoft   241 microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com NULL
640806  2/16/2012 11:54:02 AM   en-us   wireless headphones 502 www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com www.amazon.com;store.apple.com
304305  2/16/2012 11:54:03 AM   en-us   dominos pizza   60  dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos  dominos.com
460748  2/16/2012 11:54:04 AM   en-us   yelp    1270    yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp,_Inc.;facebook.com/yelp  yelp.com
354841  2/16/2012 11:59:01 AM   en-us   how to run  610 running.about.com;ehow.com;go.com   running.about.com;ehow.com
354068  2/16/2012 12:00:33 PM   en-mx   what is sql 422 wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL    wikipedia.org/wiki/SQL
674364  2/16/2012 12:00:55 PM   en-us   mexican food redmond    283 eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com  NULL
347413  2/16/2012 12:11:55 PM   en-gr   microsoft   305 microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com NULL
848434  2/16/2012 12:12:35 PM   en-ch   facebook    10  facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook facebook.com
604846  2/16/2012 12:13:55 PM   en-us   wikipedia   612 wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia  wikipedia.org
840614  2/16/2012 12:13:56 PM   en-us   xbox    1220    xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360    xbox.com/xbox360
656666  2/16/2012 12:15:55 PM   en-us   hotmail 691 hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail    NULL
951513  2/16/2012 12:17:00 PM   en-us   pokemon 63  pokemon.com;pokemon.com/us;serebii.net  pokemon.com
350350  2/16/2012 12:18:17 PM   en-us   wolfram 30  wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram    NULL
641615  2/16/2012 12:19:55 PM   en-us   kahn    119 khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name) khanacademy.org
321065  2/16/2012 12:20:03 PM   en-us   clothes 732 gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars   footballfanatics.com/college_washington_state_cougars
651777  2/16/2012 12:20:33 PM   en-us   food recipes    183 allrecipes.com;foodnetwork.com;simplyrecipes.com    foodnetwork.com
666352  2/16/2012 12:21:03 PM   en-us   weight loss 630 en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com webmd.com/diet

输出:

"01JAN17"   399266  2012-02-15T11:53:16.0000000 "en-us" "how to make nachos"    73  "www.nachos.com;www.wikipedia.com"  "NULL"
"01JAN17"   382045  2012-02-15T11:53:18.0000000 "en-gb" "best ski resorts"  614 "skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm" "ski-europe.com;www.travelersdigest.com/ski_resorts.htm"
"01JAN17"   382045  2012-02-16T11:53:20.0000000 "en-gb" "broken leg"    74  "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture" "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture"
"01JAN17"   106479  2012-02-16T11:53:50.0000000 "en-ca" "south park episodes"   24  "southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall" "southparkstudios.com"
"01JAN17"   906441  2012-02-16T11:54:01.0000000 "en-us" "cosmos"    1213    "cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos"   "NULL"
"01JAN17"   351530  2012-02-16T11:54:01.0000000 "en-fr" "microsoft" 241 "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com"   "NULL"
"01JAN17"   640806  2012-02-16T11:54:02.0000000 "en-us" "wireless headphones"   502 "www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com"   "www.amazon.com;store.apple.com"
"01JAN17"   304305  2012-02-16T11:54:03.0000000 "en-us" "dominos pizza" 60  "dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos"    "dominos.com"
"01JAN17"   460748  2012-02-16T11:54:04.0000000 "en-us" "yelp"  1270    "yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp,_Inc.;facebook.com/yelp"    "yelp.com"
"01JAN17"   354841  2012-02-16T11:59:01.0000000 "en-us" "how to run"    610 "running.about.com;ehow.com;go.com" "running.about.com;ehow.com"
"01JAN17"   354068  2012-02-16T12:00:33.0000000 "en-mx" "what is sql"   422 "wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL"  "wikipedia.org/wiki/SQL"
"01JAN17"   674364  2012-02-16T12:00:55.0000000 "en-us" "mexican food redmond"  283 "eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com"    "NULL"
"01JAN17"   347413  2012-02-16T12:11:55.0000000 "en-gr" "microsoft" 305 "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com"   "NULL"
"01JAN17"   848434  2012-02-16T12:12:35.0000000 "en-ch" "facebook"  10  "facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook"   "facebook.com"
"01JAN17"   604846  2012-02-16T12:13:55.0000000 "en-us" "wikipedia" 612 "wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia"    "wikipedia.org"
"01JAN17"   840614  2012-02-16T12:13:56.0000000 "en-us" "xbox"  1220    "xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360"  "xbox.com/xbox360"
"01JAN17"   656666  2012-02-16T12:15:55.0000000 "en-us" "hotmail"   691 "hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail"  "NULL"
"01JAN17"   951513  2012-02-16T12:17:00.0000000 "en-us" "pokemon"   63  "pokemon.com;pokemon.com/us;serebii.net"    "pokemon.com"
"01JAN17"   350350  2012-02-16T12:18:17.0000000 "en-us" "wolfram"   30  "wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram"  "NULL"
"01JAN17"   641615  2012-02-16T12:19:55.0000000 "en-us" "kahn"  119 "khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name)"   "khanacademy.org"
"01JAN17"   321065  2012-02-16T12:20:03.0000000 "en-us" "clothes"   732 "gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars" "footballfanatics.com/college_washington_state_cougars"
"01JAN17"   651777  2012-02-16T12:20:33.0000000 "en-us" "food recipes"  183 "allrecipes.com;foodnetwork.com;simplyrecipes.com"  "foodnetwork.com"
"01JAN17"   666352  2012-02-16T12:21:03.0000000 "en-us" "weight loss"   630 "en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com"   "webmd.com/diet"

答案 1 :(得分:2)

我能够使用普通的U-SQL(即不是自定义UDO)和sample file来做这样的事情:

USING rx = System.Text.RegularExpressions.Regex;

DECLARE @inputFilepath string = "input/input71.txt";
DECLARE @outputFilepath string = "output/output71.csv";


// Get the first line; use silent option to skip all other lines 
// ie which have more than one column
@file =
    EXTRACT headerLine string
    FROM @inputFilepath
    USING Extractors.Text(delimiter : '|', silent : true);


// Get the start date from the header
@header =
    SELECT
        headerLine,
        rx.Match(headerLine, @"FROM (?<startDate>\d{2}[A-Z]{3}\d{2}) TO (?<endDate>\d{2}[A-Z]{3}\d{2})").Groups["startDate"].ToString() AS startDate
    FROM @file
    WHERE headerLine.Contains("FROM");



// Get the rest of the lines; skip the header row explicity
// don't use 'silent' as it should not be required (as we're skipping header row)
@body =
    EXTRACT runDate string,
            col1 int,
            col2 int,
            col3 int
    FROM @inputFilepath
    USING Extractors.Text(delimiter : '|', skipFirstNRows : 1);


@result =
    SELECT h.startDate, p.*
    FROM @header AS h
         CROSS JOIN
             @body AS p;


// Export as csv
OUTPUT @result
TO @outputFilepath
USING Outputters.Csv(quoting:false);

我的结果:

Results

这是一个简单的例子来演示U-SQL的强大功能,将RegEx与set操作配对。看看类似的东西是否适合你。