使用自定义.NET活动合并Azure数据工厂中的两个CSV文件

时间:2017-01-26 11:24:13

标签: azure transformation azure-data-factory custom-activity

我有两个包含许多n列的CSV文件。我必须将这两个csv文件与一个CSV文件合并,该文件在两个输入文件中都有一个唯一的列。

我彻底浏览了所有博客和网站。所有这些都将导致使用自定义.NET活动。所以我只需浏览this site

但仍然无法确定C#Coding中的哪个部分。任何人都可以使用Azure数据工厂中的自定义.NET Activity共享代码以了解如何合并这两个CSV文件。

1 个答案:

答案 0 :(得分:1)

以下是如何使用U-SQL在Zip_Code列上连接这两个以制表符分隔的文件的示例。此示例假定这两个文件都保存在Azure Data Lake Storage(ADLS)中。该脚本可以很容易地合并到数据工厂管道中:

// Get raw input from file A
@inputA =
    EXTRACT 
        Date_received   string,
        Product string,
        Sub_product string,
        Issue   string,
        Sub_issue   string,
        Consumer_complaint_narrative    string,
        Company_public_response string,
        Company string,
        State   string,
        ZIP_Code    string,
        Tags    string,
        Consumer_consent_provided   string,
        Submitted_via   string,
        Date_sent_to_company    string,
        Company_response_to_consumer    string,
        Timely_response string,
        Consumer_disputed   string,
        Complaint_ID    string

    FROM "/input/input48A.txt"
    USING Extractors.Tsv();


// Get raw input from file B
@inputB =
    EXTRACT Provider_ID string,
            Hospital_Name string,
            Address string,
            City string,
            State string,
            ZIP_Code string,
            County_Name string,
            Phone_Number string,
            Hospital_Type string,
            Hospital_Ownership string,
            Emergency_Services string,
            Meets_criteria_for_meaningful_use_of_EHRs string,
            Hospital_overall_rating string,
            Hospital_overall_rating_footnote string,
            Mortality_national_comparison string,
            Mortality_national_comparison_footnote string,
            Safety_of_care_national_comparison string,
            Safety_of_care_national_comparison_footnote string,
            Readmission_national_comparison string,
            Readmission_national_comparison_footnote string,
            Patient_experience_national_comparison string,
            Patient_experience_national_comparison_footnote string,
            Effectiveness_of_care_national_comparison string,
            Effectiveness_of_care_national_comparison_footnote string,
            Timeliness_of_care_national_comparison string,
            Timeliness_of_care_national_comparison_footnote string,
            Efficient_use_of_medical_imaging_national_comparison string,
            Efficient_use_of_medical_imaging_national_comparison_footnote string,
            Location string

    FROM "/input/input48B.txt"
    USING Extractors.Tsv();


// Join the two files on the Zip_Code column
@output =
    SELECT b.Provider_ID,
           b.Hospital_Name,
           b.Address,
           b.City,
           b.State,
           b.ZIP_Code,
           a.Complaint_ID

    FROM @inputA AS a
         INNER JOIN
             @inputB AS b
         ON a.ZIP_Code == b.ZIP_Code
    WHERE a.ZIP_Code == "36033";


// Output the file
OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

这也可以转换为带有文件名和邮政编码参数的U-SQL存储过程。

当然有可能实现这一目标,各有各的利弊。例如.net自定义活动对于具有.net背景的人来说可能会感觉更舒服,但是您需要一些计算来运行它。对于在订阅中具有SQL /数据库背景和Azure SQL DB的人来说,将文件导入Azure SQL数据库是一个不错的选择。