我希望了解Record
Hadoop中MapReduce
的定义,适用于除Text之外的数据类型。
通常,对于Text
数据,记录是由新行终止的整行。
现在,如果我们要处理XML数据,那么这些数据是如何处理的,也就是Record
如何定义哪个mapper
会起作用?
我读过有InputFormat
和RecordReader
的概念,但我说得不好。
任何人都可以帮助我了解InputFormat
,RecordReader
与各种类型的数据集(文本除外)之间的关系,以及如何将数据转换为Records
哪个mapper
正在处理?
答案 0 :(得分:0)
Lets start with some basic concept.
From perspective of a file.
1. File -> collection of rows.
2. Row -> Collection of one or more columns , separated by delimiter.
2. File can be of any format=> text file, parquet file, ORC file.
Different file format, store Rows(columns) in different way , and the choice of delimiter is also different.
From Perspective of HDFS,
1. File is sequennce of bytes.
2. It has no idea of the logical structuring of file. ie Rows and columns.
3. HDFS do-sent guarantee, that a row will be contained within oe hdfs block, a row can span across two blocks.
Input Format : The code which knows how to read the file chunks from splits , and at the same time ensure if a row extends to other split, it should be considered part of the first split.
Record Reader : As you read a Split , some code(Record Reader) should be able to understand how to interpret a row from the bytes read from HDFS.
了解更多信息:
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/