如何使用Java将Orc文件的ColumnStatistics与模式(TypeDescription)中定义的列名链接?
Reader reader = OrcFile.createReader(ignored);
TypeDescription schema = reader.getSchema();
ColumnStatistics[] stats = reader.getStatistics();
列统计信息包含平面数组中所有列类型的统计信息。但是,该模式是模式树。列统计信息是该模式的树遍历(深度优先吗?)?
我尝试使用orc-statistics
,但仅输出列ID。
答案 0 :(得分:0)
找出与DFS遍历架构匹配的文件统计信息。遍历包括不包含数据的中间架构,例如Struct和List。此外,遍历包括整体架构作为第一节点。 Orc Specification v1的文档对此进行了解释:
通过预遍历将类型树展平到列表中,在每个遍历中为每种类型分配下一个ID。显然,类型树的根始终是类型id0。复合类型具有一个名为subtypes的字段,其中包含其子类型ID的列表。
从Orc TypeDescription
获取平展的模式名称列表的完整代码:
final class OrcSchemas {
private OrcSchemas() {}
/**
* Returns all schema names in a depth-first traversal of schema.
*
* <p>The given schema is represented as '<ROOT>'. Intermediate, unnamed schemas like
* StructColumnVector and ListColumnVector are represented using their category, like:
* 'parent::<STRUCT>::field'.
*
* <p>This method is useful because some Orc file methods like statistics return all column stats
* in a single flat array. The single flat array is a depth-first traversal of all columns in a
* schema, including intermediate columns like structs and lists.
*/
static ImmutableList<String> flattenNames(TypeDescription schema) {
if (schema.getChildren().isEmpty()) {
return ImmutableList.of();
}
ArrayList<String> names = Lists.newArrayListWithExpectedSize(schema.getChildren().size());
names.add("<ROOT>");
mutateAddNamesDfs("", schema, names);
return ImmutableList.copyOf(names);
}
private static void mutateAddNamesDfs(
String parentName, TypeDescription schema, List<String> dfsNames) {
String separator = "::";
ImmutableList<String> schemaNames = getFieldNames(parentName, schema);
ImmutableList<TypeDescription> children = getChildren(schema);
for (int i = 0; i < children.size(); i++) {
String name = schemaNames.get(i);
dfsNames.add(name);
TypeDescription childSchema = schema.getChildren().get(i);
mutateAddNamesDfs(name + separator, childSchema, dfsNames);
}
}
private static ImmutableList<TypeDescription> getChildren(TypeDescription schema) {
return Optional.ofNullable(schema.getChildren())
.map(ImmutableList::copyOf)
.orElse(ImmutableList.of());
}
private static ImmutableList<String> getFieldNames(String parentName, TypeDescription schema) {
final List<String> names;
try {
// For some reason, getFieldNames doesn't handle null.
names = schema.getFieldNames();
} catch (NullPointerException e) {
// If there's no children, there's definitely no field names.
if (schema.getChildren() == null) {
return ImmutableList.of();
}
// There are children, so use the category since there's no names. This occurs with
// structs and lists.
return schema.getChildren().stream()
.map(child -> parentName + "<" + child.getCategory() + ">")
.collect(toImmutableList());
}
return names.stream().map(n -> parentName + n).collect(toImmutableList());
}
}