Question

我试图在一个小的RCFile（大约200行数据）中读入HashMap来进行Map-Side连接，但是我在将文件中的数据变为可用状态时遇到了很多麻烦。

这是我到目前为止所做的，其中大部分是从this example解除的：

    public void configure(JobConf job)                                                                                                   
    {   
        try
        {                                                                                                                                
            FileSystem fs = FileSystem.get(job);                                                                                         
            RCFile.Reader rcFileReader = new RCFile.Reader(fs, new Path("/path/to/file"), job);          
            int counter = 1;   
            while (rcFileReader.next(new LongWritable(counter)))
            {
                System.out.println("Fetching data for row " + counter);                                                  
                BytesRefArrayWritable dataRead = new BytesRefArrayWritable();                                                            
                rcFileReader.getCurrentRow(dataRead);                                                                                    
                System.out.println("dataRead: " + dataRead + " dataRead.size(): " + dataRead.size());
                for (int i = 0; i < dataRead.size(); i++)                                                                                
                {
                    BytesRefWritable bytesRefRead = dataRead.get(i);                               
                    byte b1[] = bytesRefRead.getData();                                                                                  
                    Text returnData = new Text(b1);
                    System.out.println("READ-DATA = " + returnData.toString());                                                          
                }                                                        
                counter++;
            } 
        }
        catch (IOException e)
        {             
            throw new Error(e);
        }             
    }

但是，我得到的输出将每列中的所有数据连接在一起，而在任何其他行中没有数据。

Fetching data for row 1
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@7f26d3df dataRead.size(): 5
READ-DATA = 191606656066860670
READ-DATA = United StatesAmerican SamoaGuamNorthern Mariana Islands
READ-DATA = USASGUMP
READ-DATA = USSouth PacificSouth PacificSouth Pacific
READ-DATA = 19888
Fetching data for row 2
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@1cb1a4e2 dataRead.size(): 0
Fetching data for row 3
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@52c00025 dataRead.size(): 0
Fetching data for row 4
dataRead: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable@3b49a794 dataRead.size(): 0

如何正确读取此数据，以便我一次可以访问一行，例如

(191, United States, US, US, 19)？

Answer 1

经过一番挖掘后，我找到了解决方案。这里的关键是不使用RCFile.Reader，而是使用RCFileRecordReader。

这是我最终的结果，也适用于打开多个文件：

try                                                                                                                                     
{                                                                     
    FileSystem fs = FileSystem.get(job);                                                                                         
    FileStatus [] fileStatuses = fs.listStatus(new Path("/path/to/dir/"));                               
    LongWritable key = new LongWritable();                                                                                       
    BytesRefArrayWritable value = new BytesRefArrayWritable();                                                                   
    int counter = 1;                                                                                                             
    for (int i = 0; i < fileStatuses.length; i++)                                                                                
    {                                                                                                                            
        FileStatus fileStatus = fileStatuses[i];                                                                                 
        if (!fileStatus.isDir())                                                                                                 
        {                                                                                                                        
            System.out.println("File: " + fileStatus);                                                                           
            FileSplit split = new FileSplit(fileStatus.getPath(), 0, fileStatus.getLen(), job);                                  
            RCFileRecordReader reader = new RCFileRecordReader(job, split);                                                      
            while (reader.next(key, value))                                                                                      
            {                                                                                                                    
                System.out.println("Getting row " + counter);                                                                    
                AllCountriesRow acr = AllCountriesRow.valueOf(value);                                                            
                System.out.println("ROW: " + acr);                                                                                                                                                        
                counter++;                                                                                                       
            }                                                                                                                    
        }                                                                                                                        
    }                                                                                                                                                                                                                                                         
}                                                                                                                                
catch (IOException e)                                                                                                            
{                                                                                                                                
    throw new Error(e);                                                                                                          
}

And AllCountryiesRow.valueOf：

（请注意，Column是按行显示在每行中的列的枚举，serDe是ColumnarSerDe的实例

public static AllCountriesRow valueOf(BytesRefArrayWritable braw) throws IOException                                                     
{   
    try                                                                                                                                  
    {
        StructObjectInspector soi = (StructObjectInspector) serDe.getObjectInspector();                                                  
        Object row = serDe.deserialize(braw);                                                                                                                                                                                 
        List<? extends StructField> fieldRefs = soi.getAllStructFieldRefs();                                                                                                                                              

        Object fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.ID.ordinal()));                                                                  
        ObjectInspector oi = fieldRefs.get(Column.ID.ordinal()).getFieldObjectInspector();                                               
        int id = ((IntObjectInspector)oi).get(fieldData);                                                                                

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.NAME.ordinal()));                                                   
        oi = fieldRefs.get(Column.NAME.ordinal()).getFieldObjectInspector();                                                             
        String name = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CODE.ordinal()));                                                   
        oi = fieldRefs.get(Column.CODE.ordinal()).getFieldObjectInspector();
        String code = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                                     

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.REGION_NAME.ordinal()));                                            
        oi = fieldRefs.get(Column.REGION_NAME.ordinal()).getFieldObjectInspector();                                                      
        String regionName = ((StringObjectInspector)oi).getPrimitiveJavaObject(fieldData);                                               

        fieldData = soi.getStructFieldData(row, fieldRefs.get(Column.CONTINENT_ID.ordinal()));                                           
        oi = fieldRefs.get(Column.CONTINENT_ID.ordinal()).getFieldObjectInspector();                                                     
        int continentId = ((IntObjectInspector)oi).get(fieldData);                                                                       

        return new AllCountriesRow(id, name, code, regionName, continentId);                                                             
    }               
    catch (SerDeException e)
    {               
        throw new IOException(e);                                                                                                        
    }                   
}

最终会有一个AllCountriesRow对象，其中包含相关行的所有信息。

Answer 2

由于RCFiles的柱状特性，行方式读取路径与写入路径明显不同。我们仍然可以使用RCFile.Reader类来逐行读取RCFile（不需要RCFileRecordReader）。但另外我们需要使用ColumnarSerDe将列数据转换为行数据。

以下是我们可以获得的最简化的代码，用于读取RCFile行。有关更多详细信息，请参阅内联代码注释。

private static void readRCFileByRow(String pathStr)
  throws IOException, SerDeException {

  final Configuration conf = new Configuration();

  final Properties tbl = new Properties();

  /*
   * Set the column names and types using comma separated strings. 
   * The actual name of the columns are not important, as long as the count 
   * of column is correct.
   * 
   * For types, this example uses strings. byte[] can be stored as string 
   * by encoding the bytes to ASCII (such as hexString or Base64)
   * 
   * Numbers of columns and number of types must match exactly.
   */
  tbl.setProperty("columns", "col1,col2,col3,col4,col5");
  tbl.setProperty("columns.types", "string:string:string:string:string");

  /*
   * We need a ColumnarSerDe to de-serialize the columnar data to row-wise 
   * data 
   */
  ColumnarSerDe serDe = new ColumnarSerDe();
  serDe.initialize(conf, tbl);

  Path path = new Path(pathStr);
  FileSystem fs = FileSystem.get(conf);
  final RCFile.Reader reader = new RCFile.Reader(fs, path, conf);

  final LongWritable key = new LongWritable();
  final BytesRefArrayWritable cols = new BytesRefArrayWritable();

  while (reader.next(key)) {
    System.out.println("Getting next row.");

    /*
     * IMPORTANT: Pass the same cols object to the getCurrentRow API; do not 
     * create new BytesRefArrayWritable() each time. This is because one call
     * to getCurrentRow(cols) can potentially read more than one column
     * values which the serde below would take care to read one by one.
     */
    reader.getCurrentRow(cols);

    final ColumnarStruct row = (ColumnarStruct) serDe.deserialize(cols);
    final ArrayList<Object> objects = row.getFieldsAsList();
    for (final Object object : objects) {
      // Lazy decompression happens here
      final String payload = 
        ((LazyString) object).getWritableObject().toString();
      System.out.println("Value:" + payload);
    }
  }
}

在此代码中，getCourrentRow仍然按列读取数据，我们需要使用SerDe将其转换为行。此外，调用getCurrentRow()并不意味着行中的所有字段都已解压缩。实际上，根据惰性解压缩，在其字段之一被反序列化之前，不会解压缩列。为此，我们使用coulmnarStruct.getFieldsAsList()来获取对惰性对象的引用列表。实际读取发生在LazyString引用的getWritableObject()调用中。

实现相同目标的另一种方法是使用StructObjectInspector并使用copyToStandardObject API。但我发现上述方法更简单。

如何阅读RCFile

2 个答案: