猪铸造/数据类型

时间:2016-01-11 17:04:33

标签: java hadoop apache-pig cloudera avro

我试图将关系转储到AVRO文件中,但我收到了一个奇怪的错误:

org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence

我没有使用DataByteArray(bytearray),请参阅下面关系的说明。

sensitiveSet: {rank_ID: long,name: chararray,customerId: long,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray}

即使我进行了明确的投射,我也会遇到同样的错误:

sensitiveSet = foreach sensitiveSet generate (long) $0, (chararray) $1, (long) $2, (chararray) $3, (chararray) $4, (chararray) $5, (chararray) $6;

STORE sensitiveSet INTO 'testOut2222.avro'
USING org.apache.pig.piggybank.storage.avro.AvroStorage('no_schema_check', 'schema', '{"type":"record","name":"xxxx","namespace":"","fields":[{"name":"rank_ID","type":"long"},{"name":"name","type":"string","store":"no","sensitive":"na"},{"name":"customerId","type":"string","store":"yes","sensitive":"yes"},{"name":"VIN","type":"string","store":"yes","sensitive":"yes"},{"name":"birth_date","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_mileage","type":"string","store":"yes","sensitive":"no"},{"name":"fuel_consumption","type":"string","store":"yes","sensitive":"no"}]}');

编辑:

我试图定义一个输出模式,该模式应该是一个包含另外两个元组的元组,即stats:tuple(c:tuple(),d:tuple)

以下代码并非按预期工作。它以某种方式产生结构:

stats:tuple(b:tuple(c:tuple(),d:tuple()))

以下是describe生成的输出。

sourceData: {com.mortardata.pig.dataspliter_36: (stats: ((name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),(name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray)))}

是否可以创建如下结构,这意味着我需要从前面的示例中删除元组b。

grunt> describe sourceData;
sourceData: {t: (s: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray),n: (name: chararray,customerId: chararray,VIN: chararray,birth_date: chararray,fuel_mileage: chararray,fuel_consumption: chararray))}

以下代码没有按预期工作。

  public Schema outputSchema(Schema input) {
    Schema sensTuple = new Schema();
    sensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
    sensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));

    Schema nonSensTuple = new Schema();
    nonSensTuple.add(new Schema.FieldSchema("name", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("customerId", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("VIN", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("birth_date", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("fuel_mileage", DataType.CHARARRAY));
    nonSensTuple.add(new Schema.FieldSchema("fuel_consumption", DataType.CHARARRAY));


    Schema parentTuple = new Schema();
    parentTuple.add(new Schema.FieldSchema(null, sensTuple, DataType.TUPLE));
    parentTuple.add(new Schema.FieldSchema(null, nonSensTuple, DataType.TUPLE));


    Schema outputSchema = new Schema();
    outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));

    return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), outputSchema, DataType.TUPLE));

UDF的exec方法返回:

public Tuple exec(Tuple tuple) throws IOException {    
  Tuple parentTuple  = mTupleFactory.newTuple();
  parentTuple.append(tuple1);
  parentTuple.append(tuple2);

EDIT2(已修复)

...
Schema outputSchema = new Schema();
outputSchema.add(new Schema.FieldSchema("stats", parentTuple, DataType.TUPLE));
<德尔> 返回新的Schema(新的Schema.FieldSchema(getSchemaName(this.getClass()。getName()。toLowerCase(),input),outputSchema,DataType.TUPLE);
return outputSchema;

现在我从UDF返回正确的架构,其中所有项目都是chararray但是当我尝试将这些项目存储到avro文件中时类型为:string我得到了同样的错误:

java.lang.Exception: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.CharSequence
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)

解决: 好吧,问题是数据没有被转换为UDF体内的正确类型 - exec()方法。看起来现在有效!

1 个答案:

答案 0 :(得分:0)

通常这意味着您使用的UDF不会保留架构,或者它正在某个地方丢失。我相信DataByteArray是真实类型未知的后备类型。您可能需要添加一个种姓以解决此问题,但更好的解决方案是修复UDF丢弃架构的任何内容。