我必须从U-SQL中的文本文件中提取记录。第一行与其他行不同,它包含日期。我必须跳过第一行,但我必须从第一行复制日期并将其粘贴到所有行的新列中。因此,在我的最终u-sql输出查询中,每行的第一列将包含从文件的第一行复制的相同数据。有关详细信息,请参阅随附的图像文件。
请建议我使用正确的u-sql查询来完成此任务。
答案 0 :(得分:3)
这是另一种方式。我使用了Samples中的SearchLog.tsv来演示这一点。在文件的顶部,我添加了行01JAN17 TO 31JAN17
。
//Skip the first row and read all the other rows
@searchlog =
EXTRACT UserId int,
Start DateTime,
Region string,
Query string,
Duration int,
Urls string,
ClickedUrls string
FROM @"/Samples/Data/SearchLogWithHeader.tsv"
USING Extractors.Tsv(skipFirstNRows: 1);
//Extract all the text in the same file but don't parse out the individual columns
@searchlogAllText =
EXTRACT rowText string
FROM @"/Samples/Data/SearchLogWithHeader.tsv"
USING Extractors.Text(delimiter: '\n');
//Find a pattern that works for you and use the .NET expressions that match the string
@searchlogHeaderDate =
SELECT rowText.Split(' ')[1] AS FromDate FROM @searchlogAllText WHERE rowText.StartsWith("FROM");
@output = SELECT * FROM @searchlogHeaderDate CROSS JOIN @searchlog;
OUTPUT @output
TO @"/Samples/Output/SearchLog_output.tsv"
USING Outputters.Tsv();
输入:
FROM 01JAN17 TO 31JAN17
399266 2/15/2012 11:53:16 AM en-us how to make nachos 73 www.nachos.com;www.wikipedia.com NULL
382045 2/15/2012 11:53:18 AM en-gb best ski resorts 614 skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm ski-europe.com;www.travelersdigest.com/ski_resorts.htm
382045 2/16/2012 11:53:20 AM en-gb broken leg 74 mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture
106479 2/16/2012 11:53:50 AM en-ca south park episodes 24 southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall southparkstudios.com
906441 2/16/2012 11:54:01 AM en-us cosmos 1213 cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos NULL
351530 2/16/2012 11:54:01 AM en-fr microsoft 241 microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com NULL
640806 2/16/2012 11:54:02 AM en-us wireless headphones 502 www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com www.amazon.com;store.apple.com
304305 2/16/2012 11:54:03 AM en-us dominos pizza 60 dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos dominos.com
460748 2/16/2012 11:54:04 AM en-us yelp 1270 yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp,_Inc.;facebook.com/yelp yelp.com
354841 2/16/2012 11:59:01 AM en-us how to run 610 running.about.com;ehow.com;go.com running.about.com;ehow.com
354068 2/16/2012 12:00:33 PM en-mx what is sql 422 wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL wikipedia.org/wiki/SQL
674364 2/16/2012 12:00:55 PM en-us mexican food redmond 283 eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com NULL
347413 2/16/2012 12:11:55 PM en-gr microsoft 305 microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com NULL
848434 2/16/2012 12:12:35 PM en-ch facebook 10 facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook facebook.com
604846 2/16/2012 12:13:55 PM en-us wikipedia 612 wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia wikipedia.org
840614 2/16/2012 12:13:56 PM en-us xbox 1220 xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360 xbox.com/xbox360
656666 2/16/2012 12:15:55 PM en-us hotmail 691 hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail NULL
951513 2/16/2012 12:17:00 PM en-us pokemon 63 pokemon.com;pokemon.com/us;serebii.net pokemon.com
350350 2/16/2012 12:18:17 PM en-us wolfram 30 wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram NULL
641615 2/16/2012 12:19:55 PM en-us kahn 119 khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name) khanacademy.org
321065 2/16/2012 12:20:03 PM en-us clothes 732 gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars footballfanatics.com/college_washington_state_cougars
651777 2/16/2012 12:20:33 PM en-us food recipes 183 allrecipes.com;foodnetwork.com;simplyrecipes.com foodnetwork.com
666352 2/16/2012 12:21:03 PM en-us weight loss 630 en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com webmd.com/diet
输出:
"01JAN17" 399266 2012-02-15T11:53:16.0000000 "en-us" "how to make nachos" 73 "www.nachos.com;www.wikipedia.com" "NULL"
"01JAN17" 382045 2012-02-15T11:53:18.0000000 "en-gb" "best ski resorts" 614 "skiresorts.com;ski-europe.com;www.travelersdigest.com/ski_resorts.htm" "ski-europe.com;www.travelersdigest.com/ski_resorts.htm"
"01JAN17" 382045 2012-02-16T11:53:20.0000000 "en-gb" "broken leg" 74 "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture" "mayoclinic.com/health;webmd.com/a-to-z-guides;mybrokenleg.com;wikipedia.com/Bone_fracture"
"01JAN17" 106479 2012-02-16T11:53:50.0000000 "en-ca" "south park episodes" 24 "southparkstudios.com;wikipedia.org/wiki/Sout_Park;imdb.com/title/tt0121955;simon.com/mall" "southparkstudios.com"
"01JAN17" 906441 2012-02-16T11:54:01.0000000 "en-us" "cosmos" 1213 "cosmos.com;wikipedia.org/wiki/Cosmos:_A_Personal_Voyage;hulu.com/cosmos" "NULL"
"01JAN17" 351530 2012-02-16T11:54:01.0000000 "en-fr" "microsoft" 241 "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com" "NULL"
"01JAN17" 640806 2012-02-16T11:54:02.0000000 "en-us" "wireless headphones" 502 "www.amazon.com;reviews.cnet.com/wireless-headphones;store.apple.com" "www.amazon.com;store.apple.com"
"01JAN17" 304305 2012-02-16T11:54:03.0000000 "en-us" "dominos pizza" 60 "dominos.com;wikipedia.org/wiki/Domino's_Pizza;facebook.com/dominos" "dominos.com"
"01JAN17" 460748 2012-02-16T11:54:04.0000000 "en-us" "yelp" 1270 "yelp.com;apple.com/us/app/yelp;wikipedia.org/wiki/Yelp,_Inc.;facebook.com/yelp" "yelp.com"
"01JAN17" 354841 2012-02-16T11:59:01.0000000 "en-us" "how to run" 610 "running.about.com;ehow.com;go.com" "running.about.com;ehow.com"
"01JAN17" 354068 2012-02-16T12:00:33.0000000 "en-mx" "what is sql" 422 "wikipedia.org/wiki/SQL;sqlcourse.com/intro.html;wikipedia.org/wiki/Microsoft_SQL" "wikipedia.org/wiki/SQL"
"01JAN17" 674364 2012-02-16T12:00:55.0000000 "en-us" "mexican food redmond" 283 "eltoreador.com;yelp.com/c/redmond-wa/mexican;agaverest.com" "NULL"
"01JAN17" 347413 2012-02-16T12:11:55.0000000 "en-gr" "microsoft" 305 "microsoft.com;wikipedia.org/wiki/Microsoft;xbox.com" "NULL"
"01JAN17" 848434 2012-02-16T12:12:35.0000000 "en-ch" "facebook" 10 "facebook.com;facebook.com/login;wikipedia.org/wiki/Facebook" "facebook.com"
"01JAN17" 604846 2012-02-16T12:13:55.0000000 "en-us" "wikipedia" 612 "wikipedia.org;en.wikipedia.org;en.wikipedia.org/wiki/Wikipedia" "wikipedia.org"
"01JAN17" 840614 2012-02-16T12:13:56.0000000 "en-us" "xbox" 1220 "xbox.com;en.wikipedia.org/wiki/Xbox;xbox.com/xbox360" "xbox.com/xbox360"
"01JAN17" 656666 2012-02-16T12:15:55.0000000 "en-us" "hotmail" 691 "hotmail.com;login.live.com;msn.com;en.wikipedia.org/wiki/Hotmail" "NULL"
"01JAN17" 951513 2012-02-16T12:17:00.0000000 "en-us" "pokemon" 63 "pokemon.com;pokemon.com/us;serebii.net" "pokemon.com"
"01JAN17" 350350 2012-02-16T12:18:17.0000000 "en-us" "wolfram" 30 "wolframalpha.com;wolfram.com;mathworld.wolfram.com;en.wikipedia.org/wiki/Stephen_Wolfram" "NULL"
"01JAN17" 641615 2012-02-16T12:19:55.0000000 "en-us" "kahn" 119 "khanacademy.org;en.wikipedia.org/wiki/Khan_(title);answers.com/topic/genghis-khan;en.wikipedia.org/wiki/Khan_(name)" "khanacademy.org"
"01JAN17" 321065 2012-02-16T12:20:03.0000000 "en-us" "clothes" 732 "gap.com;overstock.com;forever21.com;footballfanatics.com/college_washington_state_cougars" "footballfanatics.com/college_washington_state_cougars"
"01JAN17" 651777 2012-02-16T12:20:33.0000000 "en-us" "food recipes" 183 "allrecipes.com;foodnetwork.com;simplyrecipes.com" "foodnetwork.com"
"01JAN17" 666352 2012-02-16T12:21:03.0000000 "en-us" "weight loss" 630 "en.wikipedia.org/wiki/Weight_loss;webmd.com/diet;exercise.about.com" "webmd.com/diet"
答案 1 :(得分:2)
我能够使用普通的U-SQL(即不是自定义UDO)和sample file来做这样的事情:
USING rx = System.Text.RegularExpressions.Regex;
DECLARE @inputFilepath string = "input/input71.txt";
DECLARE @outputFilepath string = "output/output71.csv";
// Get the first line; use silent option to skip all other lines
// ie which have more than one column
@file =
EXTRACT headerLine string
FROM @inputFilepath
USING Extractors.Text(delimiter : '|', silent : true);
// Get the start date from the header
@header =
SELECT
headerLine,
rx.Match(headerLine, @"FROM (?<startDate>\d{2}[A-Z]{3}\d{2}) TO (?<endDate>\d{2}[A-Z]{3}\d{2})").Groups["startDate"].ToString() AS startDate
FROM @file
WHERE headerLine.Contains("FROM");
// Get the rest of the lines; skip the header row explicity
// don't use 'silent' as it should not be required (as we're skipping header row)
@body =
EXTRACT runDate string,
col1 int,
col2 int,
col3 int
FROM @inputFilepath
USING Extractors.Text(delimiter : '|', skipFirstNRows : 1);
@result =
SELECT h.startDate, p.*
FROM @header AS h
CROSS JOIN
@body AS p;
// Export as csv
OUTPUT @result
TO @outputFilepath
USING Outputters.Csv(quoting:false);
我的结果:
这是一个简单的例子来演示U-SQL的强大功能,将RegEx与set操作配对。看看类似的东西是否适合你。