Question

这是一个简单的程序：

将记录写入Orc文件
然后尝试使用谓词下推（read(..)）

问题：

这是在Orc中使用谓词下推的正确方法吗？
searchArguments方法似乎返回所有记录，完全忽略public class TestRoundTrip { public static void main(String[] args) throws IOException { final String file = "tmp/test-round-trip.orc"; new File(file).delete(); final long highestX = 10000L; final Configuration conf = new Configuration(); write(file, highestX, conf); read(file, highestX, conf); } private static void read(String file, long highestX, Configuration conf) throws IOException { Reader reader = OrcFile.createReader( new Path(file), OrcFile.readerOptions(conf) ); //Retrieve x that is "highestX - 1000". So, only 1 value should've been retrieved. Options readerOptions = new Options(conf) .searchArgument( SearchArgumentFactory .newBuilder() .equals("x", Type.LONG, highestX - 1000) .build(), new String[]{"x"} ); RecordReader rows = reader.rows(readerOptions); VectorizedRowBatch batch = reader.getSchema().createRowBatch(); while (rows.nextBatch(batch)) { LongColumnVector x = (LongColumnVector) batch.cols[0]; LongColumnVector y = (LongColumnVector) batch.cols[1]; for (int r = 0; r < batch.size; r++) { long xValue = x.vector[r]; long yValue = y.vector[r]; System.out.println(xValue + ", " + yValue); } } rows.close(); } private static void write(String file, long highestX, Configuration conf) throws IOException { TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>"); Writer writer = OrcFile.createWriter( new Path(file), OrcFile.writerOptions(conf).setSchema(schema) ); VectorizedRowBatch batch = schema.createRowBatch(); LongColumnVector x = (LongColumnVector) batch.cols[0]; LongColumnVector y = (LongColumnVector) batch.cols[1]; for (int r = 0; r < highestX; ++r) { int row = batch.size++; x.vector[row] = r; y.vector[row] = r * 3; // If the batch is full, write it out and start over. if (batch.size == batch.getMaxSize()) { writer.addRowBatch(batch); batch.reset(); } } if (batch.size != 0) { writer.addRowBatch(batch); batch.reset(); } writer.close(); }。那是为什么？

备注：

我无法找到任何有用的单元测试来演示谓词下推如何在Orc中工作（connect-mongoskin）。我也无法找到有关此功能的任何明确文档。尝试查看Orc on GitHub和Spark代码，但我找不到任何有用的内容。

以下代码是Presto

Thread.sleep((long) (Math.random() * 100));

}

Answer 1

我知道这个问题很旧，但答案可能对某人有用。（我刚刚看到mac在几小时前写了一条评论，说的内容与我基本相同，但我认为单独的答案会更好看）

Orc在内部将数据分成所谓的“行组”（每个默认行有10000行），其中每个行组都有自己的索引。搜索参数仅用于过滤出没有行可以匹配搜索参数的行组。但是，它不会筛选出单独的行。甚至可能是索引指出一个行组与一个搜索参数匹配，而其中没有一行实际上与搜索匹配。这是因为行组索引主要由行组中每一列的最小值和最大值组成。

因此，您将不得不遍历返回的行，并跳过与搜索条件不匹配的行。

Answer 2

我遇到了同样的问题，并且我认为通过更改可以纠正

.equals("x", Type.LONG,

到

.equals("x",PredicateLeaf.Type.LONG

使用此功能时，读者似乎只返回带有相关行的批处理，而不是我们要求的一次。

为什么Apache Orc RecordReader.searchArgument（）没有正确过滤？

2 个答案: