u-sql脚本搜索字符串然后Groupby该字符串并获取不同文件的计数

时间:2016-11-26 03:54:17

标签: u-sql

我对u-sql很新,试图解决

STR1 = \全球\欧洲\莫斯科\ 12345 \ FILE1.TXT

STR2 = \ global.bee.com \欧洲\莫斯科\ 12345 \ FILE1.TXT

STR3 = \全球\欧洲\阿姆斯特丹\ 54321 \ File1.Rvt STR4 = \ global.bee.com \欧洲\阿姆斯特丹\ 12345 \ File1.Rvt

情形1: 我如何从字符串变量str1&获得“\ europe \ Moscow \ 12345 \ File1.txt” str2,我想从str1和str2然后“Groupby(\ global \ europe \ Moscow \ 12345)”中获取(“\ europe \ Moscow \ 12345 \ File1.txt”)并从路径中获取不同文件的数量( “” \欧洲\莫斯科\ 12345 \“)

所以输出将是这样的:

distinct_filesby_Location_Date

解决上面这种情况我尝试了下面的u-sql代码但不太确定我是否正在编写正确的脚本:

@inArray = SELECT new SQL.ARRAY<string>(
                filepath.Contains("\\europe")) AS path
    FROM @t;

@filesbyloc =
    SELECT [ID],
        path.Trim() AS path1
    FROM @inArray
    CROSS APPLY
    EXPLODE(path1) AS r(location);

OUTPUT @filesbyloc
TO "/Outputs/distinctfilesbylocation.tsv"
USING Outputters.Tsv();

任何帮助你都会非常感激。

2 个答案:

答案 0 :(得分:1)

一种方法是将您想要使用的所有字符串放在一个文件中,例如cut并将其保存在U-SQL输入文件夹中。还有一个包含您要匹配的城市的文件,例如cities.txt。然后尝试以下U-SQL脚本:

breaks

我的结果:

My output file results

HTH

答案 1 :(得分:1)

自由地将输入文件格式化为TSV文件,并且不知道所有列语义,这里是一种编写查询的方法。请注意,我做出了评论中提供的假设。

@d =
    EXTRACT path string,
            user string,
            num1 int,
            num2 int,
            start_date string,
            end_date string,
            flag string,
            year int,
            s string,
            another_date string
    FROM @"\users\temp\citypaths.txt"
    USING Extractors.Tsv(encoding: Encoding.Unicode);

// I assume that you have only one DateTime format culture in your file. 
// If it becomes dependent on the region or city as expressed in the path, you need to add a lookup.
@d =
SELECT new SqlArray<string>(path.Split('\\')) AS steps,
       DateTime.Parse(end_date, new CultureInfo("fr-FR", false)).Date.ToString("yyyy-MM-dd") AS end_date
FROM @d;

// This assumes your paths have a fixed formatting/mapping into the city
@d =
SELECT steps[4].ToLowerInvariant() AS city,
       end_date
FROM @d;

@res =
SELECT city,
       end_date,
       COUNT( * ) AS count
FROM @d
GROUP BY city,
         end_date;

OUTPUT @res
TO "/output/result.csv"
USING Outputters.Csv();

// Now let's pivot the date and count.

OUTPUT @res2
TO "/output/res2.csv"
USING Outputters.Csv();
        @res2 = 
SELECT city, MAP_AGG(end_date, count) AS date_count 
FROM @res 
GROUP BY city;

// This assumes you know exactly with dates you are looking for. Otherwise keep it in the first file representation.
@res2 =
SELECT city,
       date_count["2016-11-21"]AS [2016-11-21],
       date_count["2016-11-22"]AS [2016-11-22]
FROM @res2;

在私人电子邮件中收到某些示例数据后更新:

根据您发送给我的数据(在提取和计算您可以使用加入的城市之后,如Bob的回答中所述,您需要提前了解您的城市,或者从我的示例中路径中城市位置的字符串开始,您不需要事先知道城市),您希望将行集city, count, date转换为行集date, city1, city2, ...,每行包含每个城市的日期和计数。

您可以通过以下方式更改@res2的计算来轻松调整上面的示例:

// Now let's pivot the city and count.
@res2 = SELECT end_date, MAP_AGG(city, count) AS city_count 
        FROM @res 
        GROUP BY end_date;

// This assumes you know exactly with cities you are looking for. Otherwise keep it in the first file representation or use a script generation (see below).
@res2 =
SELECT end_date,
       city_count["istanbul"]AS istanbul,
       city_count["midlands"]AS midlands,
       city_count["belfast"] AS belfast, 
       city_count["acoustics"] AS acoustics, 
       city_count["amsterdam"] AS amsterdam
FROM @res2;

请注意,在我的示例中,您需要通过在SQL.MAP列中查找来枚举pivot语句中的所有城市。如果不知道apriori,您必须首先提交一个脚本,为您创建脚本。例如,假设您的city, count, date行集在文件中(或者您可以复制语句以在生成脚本和生成的脚本中生成行集),您可以将其写为以下脚本。然后获取结果并将其作为实际处理脚本提交。

// Get the rowset (could also be the actual calculation from the original file
@in = EXTRACT  city string, count int?, date string
      FROM "/users/temp/Revit_Last2Months_Results.tsv"
      USING Extractors.Tsv();

// Generate the statements for the preparation of the data before the pivot 
@stmts = SELECT * FROM (VALUES
                  ( "@s1", "EXTRACT  city string, count int?, date string FROM \"/users/temp/Revit_Last2Months_Results.tsv\" USING Extractors.Tsv();"),
                  ( "@s2", "SELECT date, MAP_AGG(city, count) AS city_count FROM @s1 GROUP BY date;" )
                  ) AS  T( stmt_name, stmt);

// Now generate the statement doing the pivot
@cities = SELECT DISTINCT city FROM @in2;

@pivots = 
SELECT "@s3" AS stmt_name, "SELECT date, "+String.Join(", ", ARRAY_AGG("city_count[\""+city+"\"] AS ["+city+"]"))+ " FROM @s2;" AS stmt 
FROM @cities;

// Now generate the OUTPUT statement after the pivot. Note that the OUTPUT does not have a statement name.
@output = 
SELECT "OUTPUT @s3 TO \"/output/pivot_gen.tsv\" USING Outputters.Tsv();" AS stmt 
FROM (VALUES(1)) AS T(x);

// Now put the statements into one rowset. Note that null are ordering high in U-SQL
@result = 
SELECT stmt_name, "=" AS assign, stmt FROM @stmts
UNION ALL SELECT stmt_name, "=" AS assign, stmt FROM @pivots
UNION ALL SELECT (string) null AS stmt_name, (string) null AS assign, stmt FROM @output;

// Now output the statements in order of the stmt_name
OUTPUT @result 
TO "/pivot.usql" 
ORDER BY stmt_name 
USING Outputters.Text(delimiter:' ', quoting:false);

现在下载文件并提交。