在AWS上使用hadoop处理图像文件

时间:2016-06-22 12:05:37

标签: java hadoop amazon-s3 amazon-emr

所以,我有这项工作要做:“从位于S3上的图像数据集中,使用弹性地图缩小(EMR)将图像转换为灰度,并将它们全部写入PDF文件”。

从那时起,我一直在寻找方法,我发现Hadoop没有默认的图像文件输入格式。所以我发现了两种方法:

问题:

我尝试了两种方法,但由于两次尝试都出错,我无法得到我想要的东西。

对于第一种方法,我得到了这个:

2016-06-22 11:54:34,894 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-2-62.us-west-2.compute.internal/172.31.2.62:8032
2016-06-22 11:54:39,387 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 170
2016-06-22 11:54:39,494 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2016-06-22 11:54:39,508 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 3454067872d644fcdf99dbacccc3e0acfcd41bc0]
2016-06-22 11:54:39,740 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:170
2016-06-22 11:54:40,366 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1466077468134_0004
2016-06-22 11:54:40,767 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1466077468134_0004
2016-06-22 11:54:40,970 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-2-62.us-west-2.compute.internal:20888/proxy/application_1466077468134_0004/
2016-06-22 11:54:40,971 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1466077468134_0004
2016-06-22 11:55:00,456 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1466077468134_0004 running in uber mode : false
2016-06-22 11:55:00,458 INFO org.apache.hadoop.mapreduce.Job (main):  map 0% reduce 0%
2016-06-22 11:55:17,726 INFO org.apache.hadoop.mapreduce.Job (main):  map 1% reduce 0%
2016-06-22 11:55:17,745 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000000_0, Status : FAILED
2016-06-22 11:55:18,783 INFO org.apache.hadoop.mapreduce.Job (main):  map 0% reduce 0%
2016-06-22 11:55:31,906 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000001_0, Status : FAILED
2016-06-22 11:55:31,908 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000002_0, Status : FAILED
2016-06-22 11:55:33,932 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000003_0, Status : FAILED
2016-06-22 11:55:49,079 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000002_1, Status : FAILED
2016-06-22 11:55:59,166 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000000_1, Status : FAILED
2016-06-22 11:56:01,184 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000001_1, Status : FAILED
2016-06-22 11:56:04,209 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000003_1, Status : FAILED
2016-06-22 11:56:19,372 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000001_2, Status : FAILED
2016-06-22 11:56:27,438 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000002_2, Status : FAILED
2016-06-22 11:56:29,457 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000000_2, Status : FAILED
2016-06-22 11:56:34,497 INFO org.apache.hadoop.mapreduce.Job (main): Task Id : attempt_1466077468134_0004_m_000003_2, Status : FAILED
2016-06-22 11:56:49,613 INFO org.apache.hadoop.mapreduce.Job (main):  map 2% reduce 0%
2016-06-22 11:56:50,620 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 100%
2016-06-22 11:56:51,644 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1466077468134_0004 failed with state FAILED due to: Task failed task_1466077468134_0004_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

2016-06-22 11:56:51,909 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 17
    Job Counters 
        Failed map tasks=13
        Killed map tasks=169
        Killed reduce tasks=3
        Launched map tasks=15
        Other local map tasks=11
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=7254456
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=302269
        Total time spent by all reduce tasks (ms)=0
        Total vcore-milliseconds taken by all map tasks=302269
        Total vcore-milliseconds taken by all reduce tasks=0
        Total megabyte-milliseconds taken by all map tasks=232142592
        Total megabyte-milliseconds taken by all reduce tasks=0
    Map-Reduce Framework
        CPU time spent (ms)=0
        Physical memory (bytes) snapshot=0
        Virtual memory (bytes) snapshot=0

对于第二种方法,我刚刚得到了这个(就像它没有找到路径一样):

Input image directory: s3://tarefahadoop/input
Input FS: local FS
Output HIB: s3://tarefahadoop/images.hib
Overwrite HIB if it exists: false

对我来说,最好的解决方案是将文件转换为SequenceFile,因为我已经完成了我必须完成的工作的第2部分(打开SequenceFile,转换为图像,将图像转换为灰度(地图)并写入在pdf(减少))。但我感谢任何其他有用的想法。

0 个答案:

没有答案