您如何阅读Java中的ORC文件?我想在一个小文件中读取一些单元测试输出验证,但我找不到解决方案。
答案 0 :(得分:11)
遇到这个并最近自己实现了
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.ql.io.orc.RecordReader;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import java.util.List;
public class OrcFileDirectReaderExample {
public static void main(String[] argv)
{
try {
Reader reader = OrcFile.createReader(HdfsFactory.getFileSystem(), new Path("/user/hadoop/000000_0"));
StructObjectInspector inspector = (StructObjectInspector)reader.getObjectInspector();
System.out.println(reader.getMetadata());
RecordReader records = reader.rows();
Object row = null;
//These objects are the metadata for each column. They give you the type of each column and can parse it unless you
//want to parse each column yourself
List fields = inspector.getAllStructFieldRefs();
for(int i = 0; i < fields.size(); ++i) {
System.out.print(((StructField)fields.get(i)).getFieldObjectInspector().getTypeName() + '\t');
}
while(records.hasNext())
{
row = records.next(row);
List value_lst = inspector.getStructFieldsDataAsList(row);
StringBuilder builder = new StringBuilder();
//iterate over the fields
//Also fields can be null if a null was passed as the input field when processing wrote this file
for(Object field : value_lst) {
if(field != null)
builder.append(field.toString());
builder.append('\t');
}
//this writes out the row as it would be if this were a Text tab seperated file
System.out.println(builder.toString());
}
}catch (Exception e)
{
e.printStackTrace();
}
}
}
答案 1 :(得分:0)
根据Apache Wiki,在Hive 0.11中引入了ORC文件格式。
因此,您需要在项目源路径中使用Hive包来读取ORC文件。相同的包是
org.apache.hadoop.hive.ql.io.orc.Reader;
org.apache.hadoop.hive.ql.io.orc.OrcFile
答案 2 :(得分:-1)
尝试使用ORCFile rowcount ...
private long getRowCount(FileSystem fs, String fName) throws Exception {
long tempCount = 0;
Reader rdr = OrcFile.createReader(fs, new Path(fName));
StructObjectInspector insp = (StructObjectInspector) rdr.getObjectInspector();
Iterable<StripeInformation> iterable = rdr.getStripes();
for(StripeInformation stripe:iterable){
tempCount = tempCount + stripe.getNumberOfRows();
}
return tempCount;
}
//fName is hdfs path to file.
long rowCount = getRowCount(fs,fName);