您能否建议可以集成到Java应用程序中的低级柱状存储引擎?
原因:我们需要使用Java API的列式存储引擎将其集成到我们的 数据处理应用程序。
背景 原始数据来自各种大小高达10GB的CSV / TSV / WtfSV文件(在大多数情况下更少)。 应用程序具有预定义的一组可配置操作来清理/转换/验证数据(类似于OpenRefine || DataWrangler || DataCleaner)
问题: 现在我们使用带有Object []值的H2 MVStore作为行,显然它存储为一个条目,因此每当我们需要处理一个列时,整行都被反序列化。
要求:
我们需要的示例性API:
DataSet dataSet = Storage.dataSet('SomeName').withFilePath('C:\data\somename.dat').open(); //Open or create if not exists
//class DataSet implements List<Column>
//class Column implements List<T>
Column<String> col1 = dataSet.column('Col1').withType(String.class) //Column automatically created if not exists (only after user try to add data)
Column<Integer> col2 = dataSet.column('Col2').withType(Integer.class)
//Load data into column
for(String s : someStringList){
long idx = col1.add(s)
}
//Low priority one
Column col3 = dataSet.column('Col3').withCustomMapper(SomeClass.class,
new Mapper {
//byte[] can be InputStream, DataInputStream, etc... it does not matter
public SomeClass read(byte[] data){ /*some logic*/}
public byte[] write(SomeClass data){ /*some logic*/}
})
//Add entire row. Typechecking at runtime would be ok
long idx = dataSet.addRow(["123", 321, new SomeClass()]);
//Get by index
SomeClass foo = dataSet.column("Col1", SomeClass.class).get(idx)
//Iterator from index
Iterator<String> it = col1.iterator(startIndex)
/*
Iterator with parallel prefetching
This iterator dynamically adjusts read-ahead buffer in a way
that provides max performance for single threaded iterator,
eg there is no need to run multiple deserializing threads
if iterator itself is slow.
*/
PrefetchingIterator<DataRow> it3 = dataSet.iterator(startIndex).withMaxPerformanceParallelPrefetcher()
//Not required function, but it would be good if it is exists:
dataSet.createComputedColumn('ComputedCol', new Function<SomeOtherClass> {
SomeOtherClass apply(Long idx){
new SomeOtherClass(col1.get(index) + col2.get(index).toString())
}
}, isMaterialized) //isMaterialized flag enables storing calculated values for caching purposes
AFAIK,有一些完整的BigData解决方案,如Apache Spark / Flink,内部支持一些列出的功能,甚至支持Apache ORC和Apache Parquet等纯存储引擎,但我在谷歌搜索所需的API方面取得了成功。 如果这些解决方案中的任何一个满足我们的需求,请给我一个指向相应API或示例页面的链接。