Question

我是Apache PDFBox库的新手。

我想将字体信息映射到PDF段落

我已经通过了Questios How to extract font styles of text contents using pdfbox?

但它没有提供有关哪个段落以哪种字体书写的信息。

例如，如果我的页面包含文字：

PARA1：Arial字体

para2：Times New Roman

然后我应该能够得到para1用Arial写的信息，而para2用Times New Roman写的。

上述问题中提出的解决方案提供了PDF页面仅包含

的信息

arial and times new roman。

Answer 1

您使用的PDFTextStripper类记录在案（参见其JavaDoc注释），如下所示：

* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.

因此，要获取特定的字体信息，您必须稍微更改一下。

此类中的字体信息一直可用，只有在输出一行时才会被丢弃，请查看其source：

protected void writePage() throws IOException
{
    [...]
    for( int i = 0; i < charactersByArticle.size(); i++)
    {
        [...]
        List<TextPosition> line = new ArrayList<TextPosition>();
        [...]
        while( textIter.hasNext() )
        {
            [...]
            if( lastPosition != null )
            {
                [...]
                if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                {
                    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                    line.clear();
                    [...]
                }
............

该列表TextPosition中的line个实例仍然提供所有可用的格式信息，其中包括使用的字体，只有在“规范化”line时，它才会缩减为纯字符。

因此，为了保留字体信息，您有不同的选项，具体取决于您想要检索字体信息的方式：

如果您想继续通过getText在单个字符串中检索所有页面内容信息（包括字体）：您更改方法
```
private List<String> normalize(List<TextPosition> line, boolean isRtlDominant, boolean hasRtl)
```
在字体更改时包含您选择的某些字体标记（例如[Arial]）。不幸的是这种方法是私有的因此，您必须复制整个PDFTextStripper类并更改副本的代码。

如果要检索其他结构中的特定信息（例如List<List<TextPosition>>），可以从PDFTextStripper派生自己的剥离器类，添加所需类型的变量，以及覆盖上面提到的protected方法writePage，复制它并仅在该行之前或之后对其进行增强

writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);

将代码添加到新变量中。 E.g。

public class MyPDFTextStripper extends PDFTextStripper
{
    public List<List<TextPosition>> myLines = new ArrayList<List<TextPosition>>();
    [...]
                if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
                {
                    writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
                    myLines.add(new ArrayList<TextPosition>(line));
                    line.clear();
                    [...]
                }

现在，您可以为getText的实例调用MyPDFTextStripper，检索纯文本作为结果，并通过新变量访问其他数据

Answer 2

要添加更多字体，而不是库字体，因此您需要专门添加字体文件。

使用PDFBox的PDF格式的文本字体信息

2 个答案: