(第一篇文章!)
我一直在玩一个示例简历数据集。 resume对象有点复杂,有多个子对象。对于我的计划的当前阶段,我试图通过将子对象存储为JSON字符串来展平数据集。我遇到了ToJSON UDF的架构问题。 (https://github.com/rjurney/pig-to-json)
如果我在Pig脚本中执行以下语句,我会在我的字段中获得正确的数据,但它会重用所有ToJSson()调用的Positions字段名称:
stringifiedJSON =
FOREACH fullJSON
GENERATE
id .. TotalYears,
com.hortonworks.pig.udf.ToJson(Awards) AS Awards:chararray,
com.hortonworks.pig.udf.ToJson(Certifications) AS Certifications:chararray,
CASE WHEN Degrees IS NULL THEN ‘[]’ ELSE com.hortonworks.pig.udf.ToJson(Degrees) END AS Degrees:chararray,
com.hortonworks.pig.udf.ToJson(Links) AS Links:chararray,
com.hortonworks.pig.udf.ToJson(Groups) AS Groups:chararray,
com.hortonworks.pig.udf.ToJson(MilitaryService) AS MilitaryService:chararray,
com.hortonworks.pig.udf.ToJson(Positions) AS Positions:chararray;
如果我描述“fullJSON”数据集,这是我得到的回报(" ......"是与讨论无关的其他字段):
fullJSON:
{
id: chararray,
..
Awards: {award: (AwardDate: chararray,AwardDescription: chararray,AwardTitle: chararray)},
Certifications: {certification: (CertDescription: chararray,CertEndDate: chararray,CertStartDate: chararray,CertTitle: chararray)},
…
Degrees: {(DegreeTitle: chararray,DegreeEndDate: chararray,DegreeStartDate: chararray,School: chararray,SchoolCity: chararray,SchoolState: chararray,DegreeEducationLevel: chararray)},
…
Links: {link: (LinkTitle: chararray,LinkURL: chararray)},
Groups: {group: (GroupDescription: chararray,GroupEndDate: chararray,GroupStartDate: chararray,GroupTitle: chararray)},
…
MilitaryService: {military_service: (MilitaryBranch: chararray,MilitaryCommendations: chararray,MilitaryCountry: chararray,MilitaryDescripton: chararray,MilitaryStartDate: chararray,MilitaryEndDate: chararray,MilitaryRank: chararray)},
…
Positions: {(Company: chararray,CompanyCity: chararray,CompanyState: chararray,JobStartDate: chararray,JobEndDate: chararray,JobTitle: chararray,IsCurrentTitle: int)},
…
}
有人有任何想法吗?我尝试将每个ToJson()调用分成他们自己的步骤,但我得到了相同的结果。
我稍后使用了ToJSON.java的源代码,我想我已将其缩小到下面的代码位。我在此之后立即添加了strSchema的日志输出,它总是返回相同的信息(位置信息)。
if (myProperties == null) {
// Retrieve our class specific properties from UDFContext
myProperties = UDFContext.getUDFContext().getUDFProperties(this.getClass());
}
String strSchema = myProperties.getProperty("horton.json.udf.schema");
这是stringifiedJSON输出的示例:
{
"id":"http://something.com/some_guy",
...
"Awards":"[]",
"Certifications":"[]",
"Degrees":"[{\"CompanyState\":null,\"CompanyCity\":null,\"JobEndDate\":\"\",\"IsCurrentTitle\":\"Bachelor's Degree\",\"JobTitle\":\"\",\"Company\":\"BS in Marketing\",\"JobStartDate\":\"State University\"}]",
"Links":"[]",
"Groups":"[]",
"MilitaryService":"[]",
"Positions":"[{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Scottsdale\",\"JobEndDate\":\"2010-03-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Job runner\",\"Company\":\"somecompany\",\"JobStartDate\":\"2005-06-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Scottsdale\",\"JobEndDate\":\"2010-03-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Sales Rep\",\"Company\":\"Company2\",\"JobStartDate\":\"2005-06-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":\"2004-12-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"Job 3\",\"Company\":\"Company3\",\"JobStartDate\":\"1991-05-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":\"2004-12-01T00:00:00.000Z\",\"IsCurrentTitle\":0,\"JobTitle\":\"CompanyRep\",\"Company\":\"Company4\",\"JobStartDate\":\"1991-05-01T00:00:00.000Z\"},{\"CompanyState\":\"AZ\",\"CompanyCity\":\"Phoenix\",\"JobEndDate\":null,\"IsCurrentTitle\":null,\"JobTitle\":\"Job5\",\"Company\":\"Company5\",\"JobStartDate\":\"2014-09-01T00:00:00.000Z\"}]"
}
答案 0 :(得分:0)
这是我最后做的事情。我很多很多而不是完成它的方式,但它有效。我宁愿不必在开始时进行7次不同的DEFINE调用,只是能够调用函数本身并使其正常工作。
我在类中添加了一个名为signature和构造函数的字符串:
String signature = null;
public ToJson(String Signature) {
signature = Signature;
}
我修改了类的outputSchema()。我将签名添加到getUDFProperties:
Properties udfProp = context.getUDFProperties(this.getClass(),new String[]{signature});
我同样修改了exec():
myProperties = UDFContext.getUDFContext().getUDFProperties(this.getClass(),new String[]{signature});
然后,在猪脚本中,我添加了几个DEFINE子句:
DEFINE awardToJson com.hortonworks.pig.udf.ToJson('award');
DEFINE certToJson com.hortonworks.pig.udf.ToJson('cert');
DEFINE degreeToJson com.hortonworks.pig.udf.ToJson('degree');
DEFINE linkToJson com.hortonworks.pig.udf.ToJson('link');
DEFINE groupToJson com.hortonworks.pig.udf.ToJson('group');
DEFINE militaryToJson com.hortonworks.pig.udf.ToJson('military');
DEFINE positionToJson com.hortonworks.pig.udf.ToJson('position');
然后我调整了猪脚本中的函数调用:
stringifiedJSON =
FOREACH fullJSON
GENERATE
id .. TotalYears,
awardToJson(Awards) AS Awards:chararray,
certToJson(Certifications) AS Certifications:chararray,
CASE WHEN Degrees IS NULL THEN '[]' ELSE degreeToJson(Degrees) END AS Degrees:chararray,
linkToJson(Links) AS Links:chararray,
groupToJson(Groups) AS Groups:chararray,
militaryToJson(MilitaryService) AS MilitaryService:chararray,
positionToJson(Positions) AS Positions:chararray
;