将嵌套列表和结构的分层数据写入Apache Orc文件

时间:2019-07-12 09:47:58

标签: java hadoop hive hierarchical-data orc

将具有可选字段和重复字段的层次结构数据写入Orc文件的正确方法是什么?

数据是JSON对象的列表。每个对象都有一个ID和一个到DOM元素的分层路径。每个DOM元素都有一个可选标记和类列表:

{id: 1, hierarchy: [
    {tag: "div", lasses: ["blue", "red"]}, 
    {tag: "p"}
]}

{id: 1, hierarchy: [
    {tag: "nav", classes: ["green"]}, 
    {classes: ["red"]}
]}

Orc(Java)中的模式如下:

TypeDescription rootSchema = createStruct();
rootSchema.addField("id", createInt());
TypeDescription domElementSchema =
    createStruct()
        .addField("tag", createString())
        .addField("classes", createList(createString()));
// First list because 1 hierarchy per JSON object. 
// Second list because each hierarchy has multiple DOM elements.
rootSchema.addField("hierarchy", createList(createList(domElementSchema)));

我对如何为嵌套列表的offsets设置lengthsListColumnVector感到困惑。到目前为止,我的代码如下:

private static void writeJsonToOrcFile(TypeDescription schema) {
  VectorizedRowBatch batch = schema.createRowBatchV2();

  var ids = (LongColumnVector) batch.cols[0];
  var hierarchies = (ListColumnVector) batch.cols[1];
  var domElems = (ListColumnVector) hierarchies.child;
  var domElemStructs = (StructColumnVector) domElems.child;
  var tags = (BytesColumnVector) domElemStructs.fields[0];
  var classesList = (ListColumnVector) domElemStructs.fields[2];
  var classes = (BytesColumnVector) classesList.child;

  int hierarchiesOffset = 0;
  int classesOffset = 0;

  for (Map<String, JsonNode> jsonNode : JsonSource.forPath("my-data.json")) {
    // The current row.
    int row = batch.size++;

    // Write object ID
    ids.vector[row] = jsonNode.get("id").asLong();

    JsonNode hierarchy = jsonNode.get("hierarchy");
    hierarchies.offsets[row] = hierarchiesOffset;
    hierarchies.lengths[row] = hierarchy.size();

    // Associate offsets for DomElements
    for (int i = 0; i < hierarchy.size(); i++) {
      // Each item of the DomElements maps to a single struct.
      domElems.offsets[row + i] = row + i;
      domElems.lengths[row + i] = 1;
    }
    if (hierarchy.size() == 0) {
      domElems.offsets[row] = row;
      domElems.lengths[row] = 0;
      domElems.noNulls = false;
      domElems.isNull[row] = true;
    }

    // Write tags
    for (JsonNode domElement : hierarchy) {
      JsonNode tag = domElement.get("tag");
      if (tag == null) {
        tags.isNull[row] = true;
        tags.noNulls = false;
      } else {
        tags.setVal(row, tag.asText().getBytes());
      }
    }

    // Write classes
    int domElemIndex = 0;
    for (JsonNode domElement : hierarchy) {
      JsonNode classesNode = domElement.get("classes");
      classesList.offsets[row + domElemIndex] = classesOffset;
      classesList.lengths[row + domElemIndex] = classesNode.size();

      for (JsonNode classNode : classesNode) {
        String cssClass = classNode.asText();
        classes.setVal(row, cssClass.getBytes());
        classesOffset += 1;
      }

      hierarchiesOffset += 1;
      domElemIndex += 1;
    }
  }
}

这种方法正确吗?

  • hierarchies根据层次结构数组的长度在domElems中使用偏移量。
  • domElemsdomElemStructs一一对应。
  • classesList使用行加domElementIndex作为classes向量的偏移量。

0 个答案:

没有答案