Question

如何处理Apache PIG脚本中的错误记录。在我的情况下，我正在处理一个逗号分隔的文件，每行通常有14个字段。

但有时行包含\ n，并且记录分为两行，我的PIG脚本很容易插入此记录，并将所有记录插入HBase。

问题是UDF中的映射长度始终为3.可能是因为PIG脚本中定义的模式。如何确定记录是否具有等于模式的字段数...

PIG

REGISTER 'files.py' using jython as myfuncs

A = LOAD '/etl/incoming/test.txt' USING PigStorage(',') AS (name:chararray, age:int, gpa:float);

B = FOREACH A {
    GENERATE
    myfuncs.checkFormat(TOTUPLE(*)) as fields;
}

DUMP B;

UDF

import org.apache.pig.data.DataType as DataType
import org.apache.pig.impl.logicalLayer.schema.SchemaUtil as SchemaUtil

@outputSchema("record:map[]")
def checkFormat(record):
    print(type(record))
    print(record)

    record = list(record)

    print("length: %d" % len(record)) #always return 3

    return record

Answer 1

您可以在a variety of languages

中将验证编写为Pig UDF

我通常返回相同的模式，其中包含表示有效性的附加字段，然后过滤结果（一次用于登录错误日志，一次用于继续操作）

Apache PIG，验证输入

1 个答案: