我有一个mapreduce作业,它接受我之前构建的序列文件。序列文件具有图像文件名作为键,图像的字节表示为值。我的映射器应该采用每个图像,然后使用基于Tess4J的名为Tesseract的图像到文本库来处理它们。映射器运行并且不会抛出任何异常,但令人惊讶的是输出文件夹是空的并且没有生成文件。这是我的映射器代码:
import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import javax.imageio.ImageIO;
import net.sourceforge.tess4j.*;
import org.apache.commons.io.output.ByteArrayOutputStream;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class testM extends Mapper<Text, BytesWritable, Text, Text> {
public void map(Text ikey, BytesWritable ivalue, Context context) throws IOException, InterruptedException {
//Read Current Image from File.
BufferedImage img = ImageIO.read(new ByteArrayInputStream(ivalue.getBytes()));
Tesseract instance = Tesseract.getInstance();
try {
String text = instance.doOCR(img);
context.write(new Text("fff"), new Text("fff"));
} catch (TesseractException e) {
context.write(new Text("fff"), new Text("fff"));
e.printStackTrace();
}
//String result = instance.doOCR(img);
}
}
这是驱动程序代码
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Image2Text");
job.setJarByClass(driver.class);
job.setMapperClass(testM.class);
// TODO: specify a reducer
job.setReducerClass(Reducer.class);
// TODO: specify output types
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//input format
job.setInputFormatClass(SequenceFileInputFormat.class);
// TODO: specify input and output DIRECTORIES (not files)
FileInputFormat.setInputPaths(job, new Path("inSeq"));
FileOutputFormat.setOutputPath(job, new Path("out"));
if (!job.waitForCompletion(true))
return;
}
我尝试输出“fff”只是为了确保映射器正常工作,但正如我所说它不输出任何东西。如果我删除行String text = instance.doOCR(img);
一切正常。我检查了我的序列文件的内容并查看了img
的值,两者看起来都很好。有谁知道问题是什么?