将PDF文件转换为HDFS上的文本(JAVA)

时间:2017-06-09 06:34:19

标签: eclipse hadoop pdfbox

在此,我使用PdfInputFormat类编写了类FileInputFormat。这个类返回PdfRecordReader类的对象,它正在进行所有PDF转换。我在这里面临一个错误。

我在Eclipse中创建jar,转到:

  

工具> Eclipse - 导出方法>出口>创造jar。

我在jar中选择了所需的包。

我正在使用以下命令执行jar:

hadoop jar /home/tcs/converter.jar com.amal.pdf.PdfInputDriver /user/tcs/wordcountfile.pdf /user/convert

运行后,我得到以下异常:

17/06/09 09:26:51 WARN mapred.LocalJobRunner: job_local1466878685_0001
java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549)
Caused by: java.lang.NoClassDefFoundError: org/apache/fontbox/cmap/CMapParser
at org.apache.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:548)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:383)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:372)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:61)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:552)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:248)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
at com.amal.pdf.PdfRecordReader.initialize(PdfRecordReader.java:43)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:548)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:786)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.fontbox.cmap.CMapParser
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 21 more
17/06/09 09:26:52 INFO mapreduce.Job: Job job_local1466878685_0001 failed with state FAILED due to: NA
17/06/09 09:26:52 INFO mapreduce.Job: Counters: 0
false

以下是代码:

PdfRecordReader class(code)
package com.amal.pdf;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfRecordReader extends RecordReader<Object, Object> 
    {
    private String[] lines = null;
    private LongWritable key = null;
    private Text value = null;
    @Override
    public void initialize(InputSplit genericSplit, TaskAttemptContext context)
            throws IOException, InterruptedException {
        FileSplit split = (FileSplit) genericSplit;
        Configuration job = context.getConfiguration();
        final Path file = split.getPath();
        /*
         * The below code contains the logic for opening the file and seek to
         * the start of the split. Here we are applying the Pdf Parsing logic
         */
        FileSystem fs = file.getFileSystem(job);
        FSDataInputStream fileIn = fs.open(split.getPath());
        PDDocument pdf = null;
        String parsedText = null;
        PDFTextStripper stripper;
        pdf = PDDocument.load(fileIn);
        stripper = new PDFTextStripper();
    //getting exception because of this line****
        parsedText = stripper.getText(pdf);
        this.lines = parsedText.split("\n");    }
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (key == null) {
            key = new LongWritable();
            key.set(1);
            value = new Text();
            value.set(lines[0]);
        } else {
            int temp = (int) key.get();
            if (temp < (lines.length - 1)) {
                int count = (int) key.get();
                value = new Text();
                value.set(lines[count]);
                count = count + 1;
                key = new LongWritable(count);
            } else {
                return false;
            }
        }
        if (key == null || value == null) {
            return false;
        } else {
            return true;
        }
    }
    @Override
    public LongWritable getCurrentKey() throws IOException,
            InterruptedException {
        return key;
    }
    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }
    @Override
    public void close() throws IOException {
    }
}

//注意:由于它适用于HADOOP环境,因此使用eclipse不会为此项目生成// runnable JAR。 //无论如何都要将此项目导出为可运行的JAR。

//需要帮助才能理解我做错了什么。

1 个答案:

答案 0 :(得分:0)

错误是因为hadoop找不到org.apache.fontbox.cmap.CMapParser类,它应该是您在代码中导入的外部库。

外部相关jar未与您用于hadoop命令的jar一起打包,因此hadoop系统无法在hdfs中找到jar。 这是因为当我们运行hadoop命令代码(jars)时,会将数据分布到hdfs cluster 中,因此找不到依赖jar。

您可以遵循两种解决方案:
1)您可以使用hadoop命令包含外部罐子

hadoop jar /home/tcs/converter.jar com.amal.pdf.PdfInputDriver -libjars <path to external jars comma separated> /user/tcs/wordcountfile.pdf /user/convert

2)或者您可以使用shade plugin并通过在您自己的jar中包含所有依赖库来创建超级jar。