我正在构建一个应用程序,我需要解析由系统生成的pdf以及我需要填充我的应用程序数据库列的解析信息,但不幸的是我正在处理的pdf结构有一个名为comments的列它既有文字又有图像。我找到了从pdf中分别阅读图像和文本的方法,但我的最终目的是在解析的内容中添加一个类似于{2}的占位符,以及每当我的解析器(应用程序代码)解析此行时系统将在该区域中呈现适当的图像,该图像也存储在我的应用程序内的单独表中。 请帮我解决这个问题。
提前致谢。
答案 0 :(得分:1)
正如评论中已经提到的,解决方案是基本上使用自定义文本提取策略在图像坐标处插入“[2]”文本块。
你可以,例如像这样扩展LocationTextExtractionStrategy
:
class SimpleMixedExtractionStrategy extends LocationTextExtractionStrategy
{
SimpleMixedExtractionStrategy(File outputPath, String name)
{
this.outputPath = outputPath;
this.name = name;
}
@Override
public void renderImage(final ImageRenderInfo renderInfo)
{
try
{
PdfImageObject image = renderInfo.getImage();
if (image == null) return;
int number = counter++;
final String filename = String.format("%s-%s.%s", name, number, image.getFileType());
Files.write(new File(outputPath, filename).toPath(), image.getImageAsBytes());
LineSegment segment = UNIT_LINE.transformBy(renderInfo.getImageCTM());
TextChunk location = new TextChunk("[" + filename + "]", segment.getStartPoint(), segment.getEndPoint(), 0f);
Field field = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
field.setAccessible(true);
List<TextChunk> locationalResult = (List<TextChunk>) field.get(this);
locationalResult.add(location);
}
catch (IOException | NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException ioe)
{
ioe.printStackTrace();
}
}
final File outputPath;
final String name;
int counter = 0;
final static LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1) , new Vector(1, 0, 1));
}
(不幸的是,对于这类工作,LocationTextExtractionStrategy
的某些成员是私有的。因此,我使用了一些Java反射。或者你可以复制整个类并相应地更改你的副本。)
使用该策略,您可以提取如下所示的混合内容:
@Test
public void testSimpleMixedExtraction() throws IOException
{
InputStream resourceStream = getClass().getResourceAsStream("book-of-vaadin-page14.pdf");
try
{
PdfReader reader = new PdfReader(resourceStream);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
SimpleMixedExtractionStrategy listener = new SimpleMixedExtractionStrategy(OUTPUT_PATH, "book-of-vaadin-page14");
parser.processContent(1, listener);
Files.write(new File(OUTPUT_PATH, "book-of-vaadin-page14.txt").toPath(), listener.getResultantText().getBytes());
}
finally
{
if (resourceStream != null)
resourceStream.close();
}
}
E.g。对于我的测试文件(包含Book of Vaadin的第14页):
你得到这个文本
Getting Started with Vaadin
• A version of Book of Vaadin that you can browse in the Eclipse Help system.
You can install the plugin as follows:
1. Start Eclipse.
2. Select Help Software Updates....
3. Select the Available Software tab.
4. Add the Vaadin plugin update site by clicking Add Site....
[book-of-vaadin-page14-0.png]
Enter the URL of the Vaadin Update Site: http://vaadin.com/eclipse and click OK. The
Vaadin site should now appear in the Software Updates window.
5. Select all the Vaadin plugins in the tree.
[book-of-vaadin-page14-1.png]
Finally, click Install.
Detailed and up-to-date installation instructions for the Eclipse plugin can be found at http://vaad-
in.com/eclipse.
Updating the Vaadin Plugin
If you have automatic updates enabled in Eclipse (see Window Preferences Install/Update
Automatic Updates), the Vaadin plugin will be updated automatically along with other plugins.
Otherwise, you can update the Vaadin plugin (there are actually multiple plugins) manually as
follows:
1. Select Help Software Updates..., the Software Updates and Add-ons window will
open.
2. Select the Installed Software tab.
14 Vaadin Plugin for Eclipse
和两张图片book-of-vaadin-page14-0.png
和book-of-vaadin-page14-1.png
OUTPUT_PATH
中的。
正如评论中已经提到的,这个解决方案适用于图像上方和/或下方但左右两边都没有文字的简单情况。
如果左侧和/或右侧也有文字,则上述代码会将LineSegment segment
计算为图像的底线,但文本策略通常可以使用文本的基线位于底线之上。
但是在这种情况下,首先必须决定在哪一行上哪一个人想要文本中的标记无论如何。在决定之后,人们可以调整上面的来源。