我是HBase和Phoenix的新手。我一直在尝试使用Apache Phoenix使用MapReduce将数据插入多个列族的HBase表中。
这是我的凤凰城创建的HBase表
CREATE TABLE defect (planning_folder_id varchar(12) NOT NULL, artifact_id VARCHAR(12) NOT NULL, data.category VARCHAR, data.root_cause VARCHAR, association.artifact_id VARCHAR(12) CONSTRAINT PK PRIMARY KEY (planning_folder_id, artifact_id));
从上面的Phoenix创建表语法,该表看起来就像下面的
--------------------------------------------------------------------------
| planning_folder_id | artifact_id | data:category | data:root_cause | association:artifact_id |
------------------------------------------------------------------------------------------------
plan1234 | artf1234 | cat_a | cause_a | artf2345
artf5678
artf8987
------------------------------------------------------------------------------------------------
plan6765 | artf5454 | cat_b | cause_a | artf2222
artf7643
artf2345
------------------------------------------------------------------------------------------------
正如您所看到的,每个工件都有许多由列族,关联和限定符,artifact_id标识的相关工件。
回到我的问题,我想写一个mapreduce作业,它读取数据并填充到我上面提到的表中。
这就是我所拥有的
映射
public class PhoenixMapper<K> extends Mapper<LongWritable, Text, K, DefectWritable> {
private Parser parser = new MyParser();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String document = value.toString();
try {
Artifact artf = parser.parse(document);
DefectWritable defect = new DefectWritable();
defect.setPlanningFolderId(artf.getPlanningFolderId());
defect.setArtifactId(artf.getId());
defect.setRootCause(artf.getRootCause());
defect.setCategory(artf.getCategory());
// How to insert this into the hbase table
defect.setAssociations(artf.getAssociations());
context.write(null, defect);
} catch (ParserConfigurationException | SAXException e) {
e.printStackTrace();
}
}
}
缺陷可写(写入表中)
public class DefectWritable implements DBWritable {
private String planningFolderId;
private String artifactId;
private String rootCause;
private String category;
private String[] associations;
// getters/setters ignored
@Override
public void write(PreparedStatement pstmt) throws SQLException {
pstmt.setString(1, planningFolderId);
pstmt.setString(2, artifactId);
pstmt.setString(3, category);
pstmt.setString(4, rootCause);
// what to do with "associations"?
}
@Override
public void readFields(ResultSet rs) throws SQLException {
planningFolderId = rs.getString("PLANNING_FOLDER_ID");
artifactId = rs.getString("ARTIFACT_ID");
category = rs.getString("CATEGORY");
rootCause = rs.getString("ROOT_CAUSE");
// what to do with association:artifact_id Array associationArray = rs.getArray("ASSOCIATION"); ?
}
}
DataImporter
public class PhoenixDataImporter extends Configured implements Tool {
private static final String DOCUMENT_START_TAG = "<artifact>";
private static final String DOCUMENT_END_TAG = "</artifact>";
private static final String TABLE_DEFECT = "DEFECT";
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
conf.set("xmlinput.start", DOCUMENT_START_TAG);
conf.set("xmlinput.end", DOCUMENT_END_TAG);
Job job = Job.getInstance(conf, getClass().getSimpleName());
job.setJarByClass(getClass());
job.setInputFormatClass(XMLInputFormat.class);
job.setOutputFormatClass(PhoenixOutputFormat.class);
job.setMapOutputValueClass(DefectWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
job.setMapperClass(PhoenixMapper.class);
job.setNumReduceTasks(0);
PhoenixMapReduceUtil.setOutput(job, TABLE_DEFECT, "PLANNING_FOLDER_ID,ARTIFACT_ID,CATEGORY,ROOT_CAUSE");
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main( String[] args ) throws Exception {
int exitCode = ToolRunner.run(HBaseConfiguration.create(), new PhoenixDataImporter(), args);
System.exit(exitCode);
}
}
代码当前将数据插入PLANNING_FOLDER_ID,ARTIFACT_ID,CATEGORY,ROOT_CAUSE列。我不知道如何插入ASSOCIATION:ARTIFACT_ID,因为某些工件可能存在多个相关工件。我目前的解决方案基于https://phoenix.apache.org/phoenix_mr.html。
任何人都可以帮我吗?因为我是hbase的新手,也许你们可以评论我目前的桌面设计?我想到了分离表(另一个关联表:artifact_id)并在查询时加入它们。但是,我会招致表现。
如果您希望我澄清有不明确的要点,请在下面发表评论:)
提前致谢
Peeranat