将具有可选字段和重复字段的层次结构数据写入Orc文件的正确方法是什么?
数据是JSON对象的列表。每个对象都有一个ID和一个到DOM元素的分层路径。每个DOM元素都有一个可选标记和类列表:
{id: 1, hierarchy: [
{tag: "div", lasses: ["blue", "red"]},
{tag: "p"}
]}
{id: 1, hierarchy: [
{tag: "nav", classes: ["green"]},
{classes: ["red"]}
]}
Orc(Java)中的模式如下:
TypeDescription rootSchema = createStruct();
rootSchema.addField("id", createInt());
TypeDescription domElementSchema =
createStruct()
.addField("tag", createString())
.addField("classes", createList(createString()));
// First list because 1 hierarchy per JSON object.
// Second list because each hierarchy has multiple DOM elements.
rootSchema.addField("hierarchy", createList(createList(domElementSchema)));
我对如何为嵌套列表的offsets设置lengths和ListColumnVector感到困惑。到目前为止,我的代码如下:
private static void writeJsonToOrcFile(TypeDescription schema) {
VectorizedRowBatch batch = schema.createRowBatchV2();
var ids = (LongColumnVector) batch.cols[0];
var hierarchies = (ListColumnVector) batch.cols[1];
var domElems = (ListColumnVector) hierarchies.child;
var domElemStructs = (StructColumnVector) domElems.child;
var tags = (BytesColumnVector) domElemStructs.fields[0];
var classesList = (ListColumnVector) domElemStructs.fields[2];
var classes = (BytesColumnVector) classesList.child;
int hierarchiesOffset = 0;
int classesOffset = 0;
for (Map<String, JsonNode> jsonNode : JsonSource.forPath("my-data.json")) {
// The current row.
int row = batch.size++;
// Write object ID
ids.vector[row] = jsonNode.get("id").asLong();
JsonNode hierarchy = jsonNode.get("hierarchy");
hierarchies.offsets[row] = hierarchiesOffset;
hierarchies.lengths[row] = hierarchy.size();
// Associate offsets for DomElements
for (int i = 0; i < hierarchy.size(); i++) {
// Each item of the DomElements maps to a single struct.
domElems.offsets[row + i] = row + i;
domElems.lengths[row + i] = 1;
}
if (hierarchy.size() == 0) {
domElems.offsets[row] = row;
domElems.lengths[row] = 0;
domElems.noNulls = false;
domElems.isNull[row] = true;
}
// Write tags
for (JsonNode domElement : hierarchy) {
JsonNode tag = domElement.get("tag");
if (tag == null) {
tags.isNull[row] = true;
tags.noNulls = false;
} else {
tags.setVal(row, tag.asText().getBytes());
}
}
// Write classes
int domElemIndex = 0;
for (JsonNode domElement : hierarchy) {
JsonNode classesNode = domElement.get("classes");
classesList.offsets[row + domElemIndex] = classesOffset;
classesList.lengths[row + domElemIndex] = classesNode.size();
for (JsonNode classNode : classesNode) {
String cssClass = classNode.asText();
classes.setVal(row, cssClass.getBytes());
classesOffset += 1;
}
hierarchiesOffset += 1;
domElemIndex += 1;
}
}
}
这种方法正确吗?
hierarchies
根据层次结构数组的长度在domElems
中使用偏移量。domElems
与domElemStructs
一一对应。classesList
使用行加domElementIndex
作为classes
向量的偏移量。