在MapReduce中记录Hadoop中不同类型数据集的定义?

时间:2017-09-21 17:22:03

标签: hadoop mapreduce hadoop2

我希望了解Record Hadoop中MapReduce的定义,适用于除Text之外的数据类型。

通常,对于Text数据,记录是由新行终止的整行。

现在,如果我们要处理XML数据,那么这些数据是如何处理的,也就是Record如何定义哪个mapper会起作用?

我读过有InputFormatRecordReader的概念,但我说得不好。

任何人都可以帮助我了解InputFormatRecordReader与各种类型的数据集(文本除外)之间的关系,以及如何将数据转换为Records哪个mapper正在处理?

1 个答案:

答案 0 :(得分:0)

Lets start with some basic concept. 

    From perspective of a file.
    1. File -> collection of rows.
    2. Row -> Collection of one or more columns , separated by  delimiter.
    2. File can be of any format=> text file, parquet file, ORC file.   

    Different file format, store Rows(columns) in different way , and the choice of delimiter is also different. 


From Perspective of HDFS, 
     1. File is sequennce of bytes.
     2. It has no idea of the logical structuring of file. ie Rows and    columns. 
     3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks. 


    Input Format :  The code which knows how to read the file chunks from  splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split. 

    Record Reader :  As you read a Split , some code(Record Reader) should be able to understand  how to interpret a row from the bytes read from HDFS.

了解更多信息:
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/