使用Apache Phoenix将使用MapReduce的列族数据插入到HBase中

时间:2015-07-11 21:28:44

标签: hadoop mapreduce hbase phoenix

我是HBase和Phoenix的新手。我一直在尝试使用Apache Phoenix使用MapReduce将数据插入多个列族的HBase表中。

这是我的凤凰城创建的HBase表

CREATE TABLE defect (planning_folder_id varchar(12) NOT NULL, artifact_id VARCHAR(12) NOT NULL, data.category VARCHAR, data.root_cause VARCHAR, association.artifact_id VARCHAR(12) CONSTRAINT PK PRIMARY KEY (planning_folder_id, artifact_id));

从上面的Phoenix创建表语法,该表看起来就像下面的

--------------------------------------------------------------------------
| planning_folder_id | artifact_id | data:category | data:root_cause | association:artifact_id | 
------------------------------------------------------------------------------------------------
     plan1234        |   artf1234  |     cat_a      |      cause_a   |       artf2345
                                                                             artf5678
                                                                             artf8987
------------------------------------------------------------------------------------------------
     plan6765        |   artf5454  |     cat_b      |      cause_a   |       artf2222
                                                                             artf7643
                                                                             artf2345
------------------------------------------------------------------------------------------------

正如您所看到的,每个工件都有许多由列族,关联和限定符,artifact_id标识的相关工件。

回到我的问题,我想写一个mapreduce作业,它读取数据并填充到我上面提到的表中。

这就是我所拥有的

映射

public class PhoenixMapper<K> extends Mapper<LongWritable, Text, K, DefectWritable> {

    private Parser parser = new MyParser();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String document = value.toString();
        try {
            Artifact artf = parser.parse(document);
            DefectWritable defect = new DefectWritable();
            defect.setPlanningFolderId(artf.getPlanningFolderId());
            defect.setArtifactId(artf.getId());
            defect.setRootCause(artf.getRootCause());
            defect.setCategory(artf.getCategory());


            // How to insert this into the hbase table
            defect.setAssociations(artf.getAssociations());

            context.write(null, defect);
        } catch (ParserConfigurationException | SAXException e) {
            e.printStackTrace();
        }
    }
}

缺陷可写(写入表中)

public class DefectWritable implements DBWritable {

    private String planningFolderId;
    private String artifactId;
    private String rootCause;
    private String category;
    private String[] associations;

    // getters/setters ignored

    @Override
    public void write(PreparedStatement pstmt) throws SQLException {
        pstmt.setString(1, planningFolderId);
        pstmt.setString(2, artifactId);
        pstmt.setString(3, category);
        pstmt.setString(4, rootCause);
        // what to do with "associations"?
    }

    @Override
    public void readFields(ResultSet rs) throws SQLException {
        planningFolderId = rs.getString("PLANNING_FOLDER_ID");
        artifactId = rs.getString("ARTIFACT_ID");
        category = rs.getString("CATEGORY");
        rootCause = rs.getString("ROOT_CAUSE");

    // what to do with association:artifact_id Array associationArray = rs.getArray("ASSOCIATION"); ?

    }
}

DataImporter

    public class PhoenixDataImporter extends Configured implements Tool {
    private static final String DOCUMENT_START_TAG = "<artifact>";
    private static final String DOCUMENT_END_TAG = "</artifact>";
    private static final String TABLE_DEFECT = "DEFECT"; 

    @Override
    public int run(String[] args) throws Exception {

        Configuration conf = getConf();
        conf.set("xmlinput.start", DOCUMENT_START_TAG);
        conf.set("xmlinput.end", DOCUMENT_END_TAG);

        Job job = Job.getInstance(conf, getClass().getSimpleName());
        job.setJarByClass(getClass());
        job.setInputFormatClass(XMLInputFormat.class);
        job.setOutputFormatClass(PhoenixOutputFormat.class);
        job.setMapOutputValueClass(DefectWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        job.setMapperClass(PhoenixMapper.class);
        job.setNumReduceTasks(0);
        PhoenixMapReduceUtil.setOutput(job, TABLE_DEFECT, "PLANNING_FOLDER_ID,ARTIFACT_ID,CATEGORY,ROOT_CAUSE");
        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main( String[] args ) throws Exception {
        int exitCode = ToolRunner.run(HBaseConfiguration.create(), new PhoenixDataImporter(), args);
        System.exit(exitCode);
    }
}

代码当前将数据插入PLANNING_FOLDER_ID,ARTIFACT_ID,CATEGORY,ROOT_CAUSE列。我不知道如何插入ASSOCIATION:ARTIFACT_ID,因为某些工件可能存在多个相关工件。我目前的解决方案基于https://phoenix.apache.org/phoenix_mr.html

任何人都可以帮我吗?因为我是hbase的新手,也许你们可以评论我目前的桌面设计?我想到了分离表(另一个关联表:artifact_id)并在查询时加入它们。但是,我会招致表现。

如果您希望我澄清有不明确的要点,请在下面发表评论:)

提前致谢

Peeranat

0 个答案:

没有答案