Question

我正在尝试使用iText PDFSweep RegexBasedCleanupStrategy从pdf中删除某些单词，但是我只想删除该单词，而不希望出现在其他单词中，例如。我想将“ al”修改为一个单词，但是我不想对“ mineral”中的“ al”进行修改。因此，我在Regex中添加了border（“ \ b”）一词作为RegexBasedCleanupStrategy的参数，

  new RegexBasedCleanupStrategy("\\bal\\b")

但是，如果单词在行尾，则pdfAutoSweep.cleanUp不起作用。

Answer 1

简而言之

此问题的原因是，将提取的文本块压扁为单个String以应用正则表达式的例程不会插入任何换行符。因此，在String中，一行的最后一个字母后面紧跟着下一行的第一个字母，从而隐藏了单词边界。如果发生换行，可以在String上添加适当的字符来解决此问题。

有问题的代码

将提取的文本块压扁为单个String的例程是软件包CharacterRenderInfo.mapString(List<CharacterRenderInfo>)中的com.itextpdf.kernel.pdf.canvas.parser.listener。在仅水平间隙的情况下，此例程将插入空格字符，但在垂直偏移（即换行）的情况下，它不会对生成StringBuilder表示形式的String附加任何额外的内容： / p>

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

可能的解决方法

可以将上面的代码扩展为在换行符的情况下插入换行符：

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    sb.append('\n');
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

仅从CharacterRenderInfo.mapString方法RegexBasedLocationExtractionStrategy（程序包getResultantLocations()）中调用此com.itextpdf.kernel.pdf.canvas.parser.listener方法，并且仅用于提到的任务，即应用正则表达式。因此，使其能够正确地允许识别单词边界应该不会破坏任何东西，但实际上应该被视为一种解决方法。

一个人可能只是考虑为换行符添加其他字符，例如如果不希望将垂直间隙与水平间隙区别对待，则可以使用' '普通空格。因此，对于一般修补程序，可以考虑使此字符成为策略的可设置属性。

版本

我使用iText 7.1.4-SNAPSHOT和PDFSweep 2.0.3-SNAPSHOT进行了测试。

iText PDFSweep RegexBasedCleanupStrategy在某些情况下不起作用

1 个答案:

简而言之

有问题的代码

可能的解决方法

版本