我在HDFS中有一个文件
44,UK,{ “名称”:{ “NAME1”: “约翰”, “NAME2”: “结婚”, “NAME3”: “司徒”}, “水果”:{ “fruit1”: “苹果” “fruit2”: “橙”}},31-07-2016
91,印度,{ “名称”:{ “NAME1”: “RAM”, “NAME2”: “山姆”}, “水果”:{}},31-07-2016
并希望使用PIG加载程序将其存储到SCV文件中:
44,UK,名称,NAME1,约翰,31-07-2016
44,英国,名称,姓名2,结婚,31-07-2016
..
44,英国,水果,fruit1,苹果,31-07-2016
..
91,印度,名称,姓名1,拉姆,31-07-2016
..
91,INDIA,null,null,Ram,31-07-2016
PIG脚本应该是什么?
答案 0 :(得分:0)
由于您的记录不是正确的JSON字符串,因此任何json存储区/加载程序都无法帮助您。编写UDF将是一种更简单的方法。
更新方法1: -
如果要将输入转换为制表符分隔文件,则UDF和PIG脚本下面的脚本将起作用。
UDF: -
package com.test.udf;
import org.apache.commons.lang3.StringUtils;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
*input format :-
* {"names":{"name1":"John","name2":"marry","name3":"stuart"},"fruits": {"fruit1":"apple","fruit2":"orange"}}
*/
public class jsonToTuples extends EvalFunc<DataBag> {
ObjectMapper objectMapper = new ObjectMapper();
TypeReference typeRef = new TypeReference<HashMap<String, Object>>() {
};
@Override
public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;
} else {
String jsonRecord = (String) input.get(0);
if (StringUtils.isNotBlank(jsonRecord)) {
try {
List<String> recordList = new ArrayList<String>();
Map<String, Object> jsonDataMap = objectMapper.readValue(jsonRecord, typeRef);
if(jsonDataMap.get("names") != null) {
Map<String, String> namesDataMap = (Map<String, String>) jsonDataMap.get("names");
for(String key : namesDataMap.keySet()){
recordList.add("names" + "," + key + "," + namesDataMap.get(key));
}
}
if(jsonDataMap.get("fruits") != null) {
Map<String, String> fruitsDataMap = (Map<String, String>) jsonDataMap.get("fruits");
for(String key : fruitsDataMap.keySet()){
recordList.add("fruits" + "," + key + "," + fruitsDataMap.get(key));
}
}
DataBag outputBag = BagFactory.getInstance().newDefaultBag();
for( int i = 0 ; i < recordList.size() ; i++){
Tuple outputTuple = TupleFactory.getInstance().newTuple(1);
outputTuple.set(0 , recordList.get(i));
outputBag.add(outputTuple);
}
return outputBag;
}catch(Exception e){
System.out.println("caught exception for ");
e.printStackTrace();
return null;
}
}
}
return null;
}
}
PIG SCRIPT: -
register 'testUDF.jar' ;
A = load 'data.txt' using PigStorage() as (id:chararray , country:chararray , record:chararray , date:chararray);
B = Foreach A generate id, country , FLATTEN(com.test.udf.jsonToTuples(record)) , date ;
dump B ;
旧方法: -
下面我提到我将在UDF中使用的方式来读取你的记录,如果它是逗号分隔的。
正如我在下面的评论中所说的那样,尝试在UDF中使用split of magic来分隔你的字段。我没有测试过,但这是我在UDF中尝试的内容: -
(请注意,我不确定这是最佳选择 - 您可能希望进一步改进。)
String[] strSplit = ((String) input.get(0)).split("," , 3);
String id = strSplit[0] ;
String country = strSplit[1] ;
String jsonWithDate = strSplit[2] ;
String[] datePart = ((String) input.get(0)).split(",");
String date = datePart[datePart.length-1];
/**
* above jsonWithDate should look like -
* {"names":{"name1":"Ram","name2":"Sam"},"fruits":{}},31-07-2016
*
*/
String jsonString = jsonWithDate.replace(date,"").replace(",$", "");
/**
* now use some parser or object mapper to convert jsonString to desired list of values.
*/