CSV中使用Pig和JSONStorage的多级JSON

时间:2017-11-06 19:06:12

标签: apache-pig

我有以下格式的CSV文件

customerid, period, credit, debit
 100, jan-2017, 500, 300
 100, jan-2017, 300,0
 100, feb-2017, 200,100
 100, mar-2017, 200,10
 200, jan-2017, 100, 200
 200, feb-2017,100,200

现在我的要求是首先按客户ID进行分组,然后逐个分组并合并事务并使用Apache Pig脚本创建如下所示的分层JSON。

    {
    {
        "customerid": 100,
        "periods": [{
            "period": "jan-2017",
            "transactions": [{"credit": 500,"debit": 300},....]
        }, {
            "period": "feb-2017",
            "transactions": [...]
        }, {
            "period": "mar-2017",
            "transactions": [....]
        }]
    }, {
        "customerid": 200,
        "periods": [{
            "period": "jan-2017",
            "transactions": [.....]
        }, {
            "period": "feb-2017",
            "transactions": [.....]
        }]
    }
}

我对Pig很新,但设法编写了以下脚本

Data = LOAD 'data.csv' USING PigStorage(',') AS (
    company_id:chararray,
    period:chararray,
    debit:chararray,
    credit:chararray)

CompanyBag = GROUP Data BY (company_id);

final_trsnactionjson = FOREACH CompanyBag {
    ByCompanyId = FOREACH Data {
        PeriodBag = GROUP Data BY (period);

        IdPeriodItemRoot = FOREACH PeriodBag{
            ItemRecords = FOREACH Source GENERATE debit as debit, credit as credit
            GENERATE group as period, TOTUPLE(ItemRecords) as transactions;
        }   
    }
    GENERATE group as customerid, TOTUPLE(PeriodBag) AS periods;
};

但这给了我以下错误

mismatched input '{' expecting GENERATE

我搜索了很多关于如何使用Pig生成嵌套Json,但找不到任何好的指针。我哪里错了?在此先感谢您的帮助

1 个答案:

答案 0 :(得分:0)

  1. 请使用Pig中提供的JsonLoader。
  2. https://pig.apache.org/docs/r0.11.1/func.html#jsonloadstore 您可以在" AS"

    中提供嵌套模式
    1. 使用com.twitter.elephantbird.pig.load.JsonLoader(' -nestedLoad')更简单地用于处理任何嵌套的JSON数组。