所以这是我第一次使用Pig,而我很难让它正确地解释我的数据。我不想在运行时为输入文件定义一个模式,所以我写了一个超级简单的自定义加载器,我对PigStorage所做的唯一更改是更改GetSchema方法来读取我文件的前两行并创建一个架构:
public ResourceSchema getSchema(String location,
Job job) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(location.replace("file://", "")));
String[] line = br.readLine().split(",");
String[] data = br.readLine().split(",");
List<FieldSchema> fields = new ArrayList<FieldSchema>();
for(int f = 0; f< line.length; f++)
{
Byte type = GetType(data[f].replace("\"", ""));
fields.add(new FieldSchema(line[f].replace("\"", ""), type));
}
schema = new ResourceSchema(new Schema(fields));
return schema;
}
private Byte GetType(Object Data)
{
try{
int number = Integer.parseInt(Data.toString());
return org.apache.pig.data.DataType.INTEGER;
}
catch(Exception e){}
try{
double dnumber = Double.parseDouble(Data.toString());
return org.apache.pig.data.DataType.DOUBLE;
}
catch(Exception e){}
return org.apache.pig.data.DataType.CHARARRAY;
}
当我加载文件并对其运行DESCRIBE时,它看起来就像我想要的那样,例如:
{CU_NUMBER: int,CYCLE_DATE: chararray,JOIN_NUMBER: int,RSSD: int,CU_TYPE: int,CU_NAME: chararray}
前10行看起来像这样:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
然而,当我尝试用以下数据做事:
FOICU = LOAD 'file:///home/biadmin/NCUA/foicu.txt' USING org.apache.pig.builtin.PigStorageInferSchema(',', '-schema');
FirstSixColumns = FOREACH FOICU GENERATE CU_NUMBER, CYCLE_DATE, JOIN_NUMBER, RSSD, CU_TYPE, CU_NAME;
TopTen = LIMIT FirstSixColumns 10;
FOICUFiltered = FILTER TopTen BY CU_NUMBER > 20;
CU_FIVE = FILTER TopTen BY CU_NUMBER == 5;
DUMP FOICUFiltered;
DUMP CU_FIVE;
FOICUFiltered返回所有10行,即使其中7行的CU_NUMBER小于20:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
CU_FIVE根本不返回任何行。
有人知道我在这里做错了什么,是否有更好的方法在运行时动态加载架构而不使用架构文件?