猪不能正确解释Int - 自定义装载机

时间:2014-02-16 16:39:13

标签: hadoop apache-pig

所以这是我第一次使用Pig,而我很难让它正确地解释我的数据。我不想在运行时为输入文件定义一个模式,所以我写了一个超级简单的自定义加载器,我对PigStorage所做的唯一更改是更改GetSchema方法来读取我文件的前两行并创建一个架构:

public ResourceSchema getSchema(String location,
        Job job) throws IOException {

    BufferedReader br = new BufferedReader(new FileReader(location.replace("file://", "")));
    String[] line = br.readLine().split(",");
    String[] data = br.readLine().split(",");

    List<FieldSchema> fields = new ArrayList<FieldSchema>();

    for(int f = 0; f< line.length; f++)
    {
        Byte type = GetType(data[f].replace("\"", ""));
        fields.add(new FieldSchema(line[f].replace("\"", ""), type));
    }

    schema = new ResourceSchema(new Schema(fields));
    return schema;
}

private Byte GetType(Object Data)
{
    try{
        int number = Integer.parseInt(Data.toString());
        return org.apache.pig.data.DataType.INTEGER;
    }
    catch(Exception e){}
    try{
        double dnumber = Double.parseDouble(Data.toString());
        return org.apache.pig.data.DataType.DOUBLE;
    }
    catch(Exception e){}

    return org.apache.pig.data.DataType.CHARARRAY;
}

当我加载文件并对其运行DESCRIBE时,它看起来就像我想要的那样,例如:

{CU_NUMBER: int,CYCLE_DATE: chararray,JOIN_NUMBER: int,RSSD: int,CU_TYPE: int,CU_NAME: chararray}

前10行看起来像这样:

(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")

然而,当我尝试用以下数据做事:

FOICU = LOAD 'file:///home/biadmin/NCUA/foicu.txt' USING org.apache.pig.builtin.PigStorageInferSchema(',', '-schema');
FirstSixColumns = FOREACH FOICU GENERATE CU_NUMBER, CYCLE_DATE, JOIN_NUMBER, RSSD, CU_TYPE, CU_NAME;
TopTen = LIMIT FirstSixColumns 10;
FOICUFiltered = FILTER TopTen BY CU_NUMBER > 20;
CU_FIVE = FILTER TopTen BY CU_NUMBER == 5;
DUMP FOICUFiltered;
DUMP CU_FIVE;

FOICUFiltered返回所有10行,即使其中7行的CU_NUMBER小于20:

(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")

CU_FIVE根本不返回任何行。

有人知道我在这里做错了什么,是否有更好的方法在运行时动态加载架构而不使用架构文件?

0 个答案:

没有答案