在TIKA 1.12中提取嵌入式TIFF与TIKA 1.14

时间:2017-02-18 03:30:15

标签: pdfbox tiff apache-tika

我的目标是将pdf转换为xhtml,以便嵌入的图像可以在html中正确链接。

以下代码适用于TIKA 1.12,但不适用于1.14。看来问题是在1.12中使用了PDFBox 1.8,在1.14中使用了PDFBox 2.0。

在1.14中,我收到字体错误和tiff错误,例如

WARN org.apache.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for g74 (103) in font HFKECA+TimesNewRoman

ERROR org.apache.pdfbox.tools.imageio.ImageIOUtil - No ImageWriter found for 'tif' format

从各种论坛看到TIKA来源,似乎我需要将jai_imageio.jar包含在我的tiff路径中。这并没有阻止错误。我还尝试添加以下内容并使用jai-imageio-core-1.3.1.jar(github版本)交换它:

jempbox-1.8.13.jar
fontbox-2.0.4.jar
levigo-jbig2-imageio-1.6.5.jar

同样,这些罐子似乎都没有做任何事情。但是,使用TIKA 1.12罐子可以获得完美的效果。

如果TIKA 1.14没有这些警告(如TIKA 1.12那样),我需要做些什么?

/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.tika.example;


import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.nio.charset.Charset;

import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.extractor.EmbeddedDocumentExtractor;
import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
import org.apache.tika.io.FilenameUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import org.apache.tika.sax.ToXMLContentHandler;
import org.apache.tika.parser.pdf.PDFParserConfig;

public class ExtractEmbeddedFiles {

  private Parser parser = new AutoDetectParser();
  private Detector detector = ((AutoDetectParser)parser).getDetector();
  private TikaConfig config = TikaConfig.getDefaultConfig();

  public void extract(String inputPath) throws SAXException, TikaException, IOException {

    File inputFile = new File(inputPath);
    String parentDirectory = inputFile.getAbsoluteFile().getParentFile().getPath();

    InputStream inputStream = new FileInputStream(inputFile);
    File outputDirectory = new File(parentDirectory, inputFile.getName() + "-extracted" );


    Parser parser = new AutoDetectParser();
    ToXMLContentHandler handler = new org.apache.tika.sax.ToXMLContentHandler();

    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);

    ParseContext parseContext = new ParseContext();

    parseContext.set(PDFParserConfig.class, pdfConfig);
    parseContext.set(Parser.class, parser); 

    EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(outputDirectory.toPath(), parseContext);
    parseContext.set(EmbeddedDocumentExtractor.class, ex);

    Metadata metadata = new Metadata();

    parser.parse(inputStream, handler, metadata, parseContext);

    String text = handler.toString().trim();

    File outputFile = new File(outputDirectory, inputFile.getName() + ".xhtml" );


     PrintWriter printer = new PrintWriter( new BufferedWriter (new OutputStreamWriter(
       new FileOutputStream( outputFile.getPath()  ),
       Charset.forName("UTF-8").newEncoder() 
 )));
    printer.print( text );
    printer.close();

  }

  private class MyEmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor {
    private final Path outputDir;
    private int fileCount = 0;

    private MyEmbeddedDocumentExtractor(Path outputDir, ParseContext context) {
      super(context);
      this.outputDir = outputDir;
    }

    @Override
    public boolean shouldParseEmbedded(Metadata metadata) {
      return true;
    }

    @Override
    public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
      throws SAXException, IOException {

      //try to get the name of the embedded file from the metadata
      String name = metadata.get(Metadata.RESOURCE_NAME_KEY);

      if (name == null) {
        name = "file_" + fileCount++;
      } else {
        //make sure to select only the file name (not any directory paths
        //that might be included in the name) and make sure
        //to normalize the name
        name = FilenameUtils.normalize(FilenameUtils.getName(name));
      }

      //now try to figure out the right extension for the embedded file
      MediaType contentType = detector.detect(stream, metadata);

      if (name.indexOf('.')==-1 && contentType!=null) {
        try {
          name += config.getMimeRepository().forName(
          contentType.toString()).getExtension();
        } catch (MimeTypeException e) {
          e.printStackTrace();
        }
      }
      //should add check to make sure that you aren't overwriting a file
      Path outputFile = outputDir.resolve(name);

      //do a better job than this of checking
      Files.createDirectories(outputFile.getParent());
      Files.copy(stream, outputFile);
    }
  }
}

1 个答案:

答案 0 :(得分:0)

这很奇怪,但现在似乎正在使用以下的库组合:

  • 蒂卡-APP-1.14.jar
  • jai_imageio.jar

可能问题是我最初为ImageIO使用了不同的库。

有一个variety of jai libraries available from Java,其中只有一个是上面的库。

我也试过这个GitHub library

除了上面的那个之外,它们可能都没有起作用,但我不愿意做出那么强烈的主张。