Hadoop将输入数据的内容分解为块,而不考虑内容 作为帖子描述:
HDFS不知道(并且不关心)文件中存储的内容,因此原始文件不会按照我们人们理解的规则进行拆分。例如,人类希望记录边界 - 显示记录开始和结束位置的线条 - 得到尊重。
我不清楚的部分是,如果数据仅根据数据大小进行拆分而不考虑内容,那么以后执行的查询的准确性是否会受到影响?例如,一个经常被列为城市列表和日常温度的例子。一个城市可以在一个街区,其温度在其他地方,然后地图操作如何正确地查询信息。我遗漏的块和MR查询似乎有一些基本的东西 任何帮助将不胜感激。
答案 0 :(得分:0)
一个城市可能在一个街区,其温度在其他地方
这完全有可能,是的。在这种情况下,记录边界跨越两个块,并且两者都被收集。
准确性不会丢失,但就磁盘和网络IO而言,性能确实如此。当检测到块的结尾而没有到达InputSplit时,则读取下一个块。即使该分割位于该后续块的前几个字节内,它仍然处理它的字节流。
答案 1 :(得分:0)
Lets get into basics of ext FileSystem(forget HDFS for timebeing).
1. In your Hardisk data is stored in form of track and sectors. Now when a file is stored its not necessary the complete record will be saved in the same block(4kb) and it can span across blocks .
2. The Process which is reading the files, reads the block and find the record boundary (Record is a logical entity). A record is a logical entity
3. The file saved into Hardisk as bytes has no understanding of record or file format. File format and records are logical entities.
Apply the same logic on HDFS.
1. The block size is 128MB.
2. Just like ext filesystem HDFS has no clue of the record boundaries.
3. What Mappers do is logically find the record boundaries by
a. The mapper which reads fileOffset 0 starts reading from start of file, till it finds \n.
b. All mapper which don't read a file from offset 0 will skip the bytes till they reach \n and then continue reading. The sequences of bytes till newline is ommited. Now this byte sequence can be a complete record or partial record and is consumed by other mapper.
c. Mappers will read the block they are supposed to and continue reading till they find \n which is present in other block and not on the block which is local to them.
d. Except first mapper all other mapper read the block local to them and byte sequence from other block till they find \n.
答案 2 :(得分:0)
见阿里,
数据块由hadoop hdfs网关在存储过程中决定,这将基于hadoop版本1.x或2.x并且还取决于文件的大小,您从本地放置到hadoop网关和稍后在-put命令之后,hadoop网关将文件块和存储分割为数据节点目录/data/dfs/data/current/
(如果您在单个节点上运行,那么它位于您的hadoop目录中),形式为blk _ <job_process_id>
和关于具有相同job_id名称和扩展名为.meta的blk_的元数据。
Hadoop 1中的数据块大小为64 mb,而在Hadoop 2中,它的块大小增加到128 mb,此后根据文件大小,它按照我之前所说的分割块,所以没有工具来控制这个hadoop hdfs,如果有的话,请告诉我!
在Hadoop 1中,我们简单地将一个文件放在集群中,如下所示,如果文件大小为100 mb那么 -
bin/hadoop fs -put <full-path of the input file till ext> </user/datanode/(target-dir)>
Hadoop 1网关将文件划分为两个(64 mb和36 mb)块,Hadoop 2只需一个块,并根据您的配置顺序复制这些块。
如果您使用hadoop放置一个jar用于map reduce作业,那么您可以在java Mapper-Reducer类中将org.apache.hadoop.mapreduce.Job
方法设置为1以及稍后测试导出一个jar用于mr作业,如下所示。
//Setting the Results to Single Target File in Java File inside main method
job.setNumReduceTasks(1);
然后运行hadoop fs脚本,如:
bin/hadoop jar <full class path of your jar file> <full class path of Main class inside jar> <input directory or file path> <give the output target directory>
如果您使用sqoop从rdbms引擎导入数据,那么您可以使用&#34; -m 1&#34;设置单个文件的结果,但它与您的问题不同。
希望,我的答案会在这个问题上给你一眼,谢谢。