Pig - outputSchema - 为元组创建模式

时间:2016-01-12 11:05:07

标签: java hadoop apache-pig cloudera

我试图定义输出模式,它应该是包含另外两个元组的元组,即stats:tuple(c:tuple(),d:tuple)

以下代码无法按预期工作。它以某种方式产生结构:

stats:tuple(b:tuple(c:tuple(),d:tuple()))

以下是describe产生的输出。

sourceData: {com.mortardata.pig.dataspliter_36: (stats: ((name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),(name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray)))}

是否可以创建如下结构,这意味着我需要从前面的示例中删除元组b。

grunt> describe sourceData;
sourceData: {t: (s: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),n: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray))}

以下代码无法按预期工作。

  public Schema outputSchema(Schema input) {
            Schema sensTuple = new Schema();
            sensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
            sensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
            sensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
            sensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
            sensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
            sensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));

            Schema nonSensTuple = new Schema();
            nonSensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
            nonSensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
            nonSensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
            nonSensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
            nonSensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
            nonSensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));


            Schema parentTuple = new Schema();
            parentTuple.add(new Schema.FieldSchema(null, sensTuple, DataType.TUPLE));
            parentTuple.add(new Schema.FieldSchema(null, nonSensTuple, DataType.TUPLE));


            Schema outputSchema = new Schema();
                outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));

            return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
                    outputSchema, DataType.TUPLE));

UDF的exec方法返回:

    public Tuple exec(Tuple tuple) throws IOException {

              Tuple parentTuple  = mTupleFactory.newTuple();

              parentTuple.append(tuple1);
              parentTuple.append(tuple2);

谢谢

0 个答案:

没有答案