我试图在一个小的RCFile(大约200行数据)中读入HashMap来进行Map-Side连接,但是我在将文件中的数据变为可用状态时遇到了很多麻烦。
这是我到目前为止所做的,其中大部分是从this example解除的:
public void configure(JobConf job)
{
try
{
FileSystem fs = FileSystem.get(job);
RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("/path/to/file"), job);
int counter = 1;
while (rcFileReader.next(new LongWritable(counter)))
{
System.out.println("Fetching data for row " + counter);
BytesRefArrayWritable dataRead = new BytesRefArrayWritable();
rcFileReader.getCurrentRow(dataRead);
System.out.println("dataRead: " + dataRead + " dataRead.size(): " + dataRead.size());
for (int i = 0; i < dataRead.size(); i++)
{
BytesRefWritable bytesRefRead = dataRead.get(i);
byte b1[] = bytesRefRead.getData();
Text returnData = new Text(b1);
System.out.println("READ-DATA = " + returnData.toString());
}
counter++;
}
}
catch (IOException e)
{
throw new Error(e);
}
}
但是,我得到的输出将每列中的所有数据连接在一起,而在任何其他行中没有数据。
Fetching data for row 1
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@7f26d3df dataRead.size(): 5
READ-DATA = 191606656066860670
READ-DATA = United StatesAmerican SamoaGuamNorthern Mariana Islands
READ-DATA = USASGUMP
READ-DATA = USSouth PacificSouth PacificSouth Pacific
READ-DATA = 19888
Fetching data for row 2
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@1cb1a4e2 dataRead.size(): 0
Fetching data for row 3
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@52c00025 dataRead.size(): 0
Fetching data for row 4
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@3b49a794 dataRead.size(): 0
如何正确读取此数据,以便我一次可以访问一行,例如
(191, United States, US, US, 19)
?
答案 0 :(得分:0)
经过一番挖掘后,我找到了解决方案。这里的关键是不使用RCFile.Reader
,而是使用RCFileRecordReader
。
这是我最终的结果,也适用于打开多个文件:
try
{
FileSystem fs = FileSystem.get(job);
FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));
LongWritable key = new LongWritable();
BytesRefArrayWritable value = new BytesRefArrayWritable();
int counter = 1;
for (int i = 0; i < fileStatuses.length; i++)
{
FileStatus fileStatus = fileStatuses[i];
if (!fileStatus.isDir())
{
System.out.println("File: " + fileStatus);
FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);
RCFileRecordReader reader = new RCFileRecordReader(job, split);
while (reader.next(key, value))
{
System.out.println("Getting row " + counter);
AllCountriesRow acr = AllCountriesRow.valueOf(value);
System.out.println("ROW: " + acr);
counter++;
}
}
}
}
catch (IOException e)
{
throw new Error(e);
}
And AllCountryiesRow.valueOf:
(请注意,Column
是按行显示在每行中的列的枚举,serDe
是ColumnarSerDe
的实例
public static AllCountriesRow valueOf(BytesRefArrayWritable braw) throws IOException
{
try
{
StructObjectInspector soi = (StructObjectInspector) serDe.getObjectInspector();
Object row = serDe.deserialize(braw);
List<? extends StructField> fieldRefs = soi.getAllStructFieldRefs();
Object fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.ID.ordinal()));
ObjectInspector oi = fieldRefs.get(Column.ID.ordinal()).getFieldObjectInspector();
int id = ((IntObjectInspector)oi).get(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.NAME.ordinal()));
oi = fieldRefs.get(Column.NAME.ordinal()).getFieldObjectInspector();
String name = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CODE.ordinal()));
oi = fieldRefs.get(Column.CODE.ordinal()).getFieldObjectInspector();
String code = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.REGION_NAME.ordinal()));
oi = fieldRefs.get(Column.REGION_NAME.ordinal()).getFieldObjectInspector();
String regionName = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);
fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CONTINENT_ID.ordinal()));
oi = fieldRefs.get(Column.CONTINENT_ID.ordinal()).getFieldObjectInspector();
int continentId = ((IntObjectInspector)oi).get(fieldData);
return new AllCountriesRow(id, name, code, regionName, continentId);
}
catch (SerDeException e)
{
throw new IOException(e);
}
}
最终会有一个AllCountriesRow对象,其中包含相关行的所有信息。
答案 1 :(得分:0)
由于RCFiles的柱状特性,行方式读取路径与写入路径明显不同。我们仍然可以使用RCFile.Reader类来逐行读取RCFile(不需要RCFileRecordReader)。但另外我们需要使用ColumnarSerDe将列数据转换为行数据。
以下是我们可以获得的最简化的代码,用于读取RCFile行。有关更多详细信息,请参阅内联代码注释。
private static void readRCFileByRow(String pathStr)
throws IOException, SerDeException {
final Configuration conf = new Configuration();
final Properties tbl = new Properties();
/*
* Set the column names and types using comma separated strings.
* The actual name of the columns are not important, as long as the count
* of column is correct.
*
* For types, this example uses strings. byte[] can be stored as string
* by encoding the bytes to ASCII (such as hexString or Base64)
*
* Numbers of columns and number of types must match exactly.
*/
tbl.setProperty("columns", "col1,col2,col3,col4,col5");
tbl.setProperty("columns.types", "string:string:string:string:string");
/*
* We need a ColumnarSerDe to de-serialize the columnar data to row-wise
* data
*/
ColumnarSerDe serDe = new ColumnarSerDe();
serDe.initialize(conf, tbl);
Path path = new Path(pathStr);
FileSystem fs = FileSystem.get(conf);
final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);
final LongWritable key = new LongWritable();
final BytesRefArrayWritable cols = new BytesRefArrayWritable();
while (reader.next(key)) {
System.out.println("Getting next row.");
/*
* IMPORTANT: Pass the same cols object to the getCurrentRow API; do not
* create new BytesRefArrayWritable() each time. This is because one call
* to getCurrentRow(cols) can potentially read more than one column
* values which the serde below would take care to read one by one.
*/
reader.getCurrentRow(cols);
final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
final ArrayList<Object> objects = row.getFieldsAsList();
for (final Object object : objects) {
// Lazy decompression happens here
final String payload =
((LazyString) object).getWritableObject().toString();
System.out.println("Value:" + payload);
}
}
}
在此代码中,getCourrentRow仍然按列读取数据,我们需要使用SerDe将其转换为行。此外,调用getCurrentRow()
并不意味着行中的所有字段都已解压缩。实际上,根据惰性解压缩,在其字段之一被反序列化之前,不会解压缩列。为此,我们使用coulmnarStruct.getFieldsAsList()
来获取对惰性对象的引用列表。实际读取发生在LazyString引用的getWritableObject()
调用中。
实现相同目标的另一种方法是使用StructObjectInspector
并使用copyToStandardObject
API。但我发现上述方法更简单。