我的目标是将pdf转换为xhtml,以便嵌入的图像可以在html中正确链接。
以下代码适用于TIKA 1.12,但不适用于1.14。看来问题是在1.12中使用了PDFBox 1.8,在1.14中使用了PDFBox 2.0。
在1.14中,我收到字体错误和tiff错误,例如
WARN org.apache.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for g74 (103) in font HFKECA+TimesNewRoman
ERROR org.apache.pdfbox.tools.imageio.ImageIOUtil - No ImageWriter found for 'tif' format
从各种论坛看到TIKA来源,似乎我需要将jai_imageio.jar
包含在我的tiff路径中。这并没有阻止错误。我还尝试添加以下内容并使用jai-imageio-core-1.3.1.jar
(github版本)交换它:
jempbox-1.8.13.jar
fontbox-2.0.4.jar
levigo-jbig2-imageio-1.6.5.jar
同样,这些罐子似乎都没有做任何事情。但是,使用TIKA 1.12罐子可以获得完美的效果。
如果TIKA 1.14没有这些警告(如TIKA 1.12那样),我需要做些什么?
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.tika.example;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.io.FileWriter;
import java.io.PrintWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.nio.charset.Charset;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.extractor.EmbeddedDocumentExtractor;
import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
import org.apache.tika.io.FilenameUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.apache.tika.sax.ToXMLContentHandler;
import org.apache.tika.parser.pdf.PDFParserConfig;
public class ExtractEmbeddedFiles {
private Parser parser = new AutoDetectParser();
private Detector detector = ((AutoDetectParser)parser).getDetector();
private TikaConfig config = TikaConfig.getDefaultConfig();
public void extract(String inputPath) throws SAXException, TikaException, IOException {
File inputFile = new File(inputPath);
String parentDirectory = inputFile.getAbsoluteFile().getParentFile().getPath();
InputStream inputStream = new FileInputStream(inputFile);
File outputDirectory = new File(parentDirectory, inputFile.getName() + "-extracted" );
Parser parser = new AutoDetectParser();
ToXMLContentHandler handler = new org.apache.tika.sax.ToXMLContentHandler();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser);
EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(outputDirectory.toPath(), parseContext);
parseContext.set(EmbeddedDocumentExtractor.class, ex);
Metadata metadata = new Metadata();
parser.parse(inputStream, handler, metadata, parseContext);
String text = handler.toString().trim();
File outputFile = new File(outputDirectory, inputFile.getName() + ".xhtml" );
PrintWriter printer = new PrintWriter( new BufferedWriter (new OutputStreamWriter(
new FileOutputStream( outputFile.getPath() ),
Charset.forName("UTF-8").newEncoder()
)));
printer.print( text );
printer.close();
}
private class MyEmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor {
private final Path outputDir;
private int fileCount = 0;
private MyEmbeddedDocumentExtractor(Path outputDir, ParseContext context) {
super(context);
this.outputDir = outputDir;
}
@Override
public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}
@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
//try to get the name of the embedded file from the metadata
String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
if (name == null) {
name = "file_" + fileCount++;
} else {
//make sure to select only the file name (not any directory paths
//that might be included in the name) and make sure
//to normalize the name
name = FilenameUtils.normalize(FilenameUtils.getName(name));
}
//now try to figure out the right extension for the embedded file
MediaType contentType = detector.detect(stream, metadata);
if (name.indexOf('.')==-1 && contentType!=null) {
try {
name += config.getMimeRepository().forName(
contentType.toString()).getExtension();
} catch (MimeTypeException e) {
e.printStackTrace();
}
}
//should add check to make sure that you aren't overwriting a file
Path outputFile = outputDir.resolve(name);
//do a better job than this of checking
Files.createDirectories(outputFile.getParent());
Files.copy(stream, outputFile);
}
}
}
答案 0 :(得分:0)
这很奇怪,但现在似乎正在使用以下的库组合:
可能问题是我最初为ImageIO使用了不同的库。
有一个variety of jai libraries available from Java,其中只有一个是上面的库。
我也试过这个GitHub library。
除了上面的那个之外,它们可能都没有起作用,但我不愿意做出那么强烈的主张。