Question

这是关于mapreduce输出的基本问题。

我正在尝试创建一个map函数，它接受一个xml文件并使用apache fop生成一个pdf。但是我对如何输出它有点困惑，因为我知道它作为（键，值）对出现。

我也没有使用流媒体来做这件事。

Answer 1

map-reduce的目的是处理通常不适合内存的大量数据 - 因此输入和输出通常以某种方式存储在磁盘上（a.k.a。文件）。输入输出必须在key-value format

中指定

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

我没试过，但这就是我要做的事情：

将mapper的输出写入此表单：key是Text中的文件名（保持文件名唯一），value是TextOutputFormat中fop的输出。用TextOutputFormat写下来。

<强>建议：

我假设您的用例只是读取输入xml（可能对其数据执行某些操作）并使用fop将数据写入PDF文件。我不认为这是一个hadoop用例首先... becoz无论你想做什么都可以通过批处理脚本来完成。你的xml文件有多大？你需要处理多少个xml文件？

修改

SequenceFileOutputFormat将写入SequenceFile。 SequenceFile有自己的标题和其他元数据以及存储的文本。它还以key：values的形式存储数据。

SequenceFile Common Header version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (e.g. SEQ4 or SEQ6) keyClassName - String valueClassName - String compression - A boolean which specifies if compression is turned on for keys/values in this file. blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file. compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (if compression is enabled). metadata - SequenceFile.Metadata for this file (key/value pairs) sync - A sync marker to denote end of the header.

使用SequenceFile破坏了您的应用程序，因为您最终会损坏输出PDF文件。试试this，亲眼看看吧。

你有很多输入文件......这就是hadoop很糟糕的地方。 (read this)。我仍然觉得你可以使用脚本逐个调用每个文档fop来执行所需的操作。如果您有多个节点，请在输入文档的不同子集上运行相同的脚本。相信我，考虑到创建地图所涉及的开销并减少（你不需要减少......我知道），这将比hadoop运行得更快。

如何从地图作业输出整个文件？

1 个答案: