当我尝试从上面的pdf中提取文本时,我得到了一个在evince查看器中看不见的文本混合文本以及可见的文本。此外,一些所需的文本缺少观众中没有丢失的字符,例如“FALCONS”中的“S”和许多缺少的“½”字符。我认为这是由于隐形文本的干扰,因为在查看器中突出显示pdf时,可以看到隐藏文本与可见文本重叠。
有没有办法删除不可见的文字?还是有另一种解决方案吗?
代码:
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class App {
public static String getPdfText(String pdfPath) throws IOException {
File file = new File(pdfPath);
PDDocument document = null;
PDFTextStripper textStripper = null;
String text = null;
try {
document = PDDocument.load(file);
textStripper = new PDFTextStripper();
textStripper.setEndPage(1);
text = textStripper.getText(document);
} catch (IOException e) {
throw new IOException("Could not load file and strip text.", e);
} finally {
try {
if (document != null)
document.close();
} catch (IOException e) {
System.out.println("Could not close document");
}
}
return text;
}
public static void main(String[] args) {
String filename = "RevTeaser09072016.pdf";
String text = null;
try {
text = getPdfText(filename);
} catch (IOException e) {
e.printStackTrace();
System.exit(1);
}
System.out.println(text);
}
}
输出(粗体文本是所需文本):
145 143 159 144 160 141 157155 156154150 153149 152148 151147 142 158 500 146 Selections Number of Teams Amount Bet REVERSE tEaSER caRd mark box as shown denotes home team PRO FOOTBALL - THURSDAY, NOVEMBER 15, 2012 1 BILLS ★ NFL PM8:25 2 DOLPHINS7– ½ 6– ½ PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012 3 REDSKINS ★ PM1:00 4 EAGLES10– ½ 3– ½ 5 PACKERS PM1:00 6 LIONS ★10– ½ 3– ½ 7 FALCONS ★ PM1:00 8 CARDINALS17– ½ 3+ ½ 9 BUCCANEERS PM1:00 10 PANTHERS ★7– ½ 6– ½ 11 COWBOYS ★ PM1:00 12 BROWNS14– ½ + ½ 13 RAMS ★ PM1:00 14 JETS10– ½ 3– ½ 15 PATRIOTS ★ PM4:25 16 COLTS17– ½ 3+ ½ 17 TEXANS ★ PM1:00 18 JAGUARS23– ½ 9+ ½ 19 BENGALS PM1:00 20 CHIEFS ★10– ½ 3– ½ 21 SAINTS PM4:05 22 RAIDERS ★12– ½ 1– ½ 23 BRONCOS ★ PM4:25 24 CHARGERS14– ½ + ½ 25 RAVENS NBC PM8:30 26 STEELERS ★7– ½ 6– ½ PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012 27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½ 1,000 145 143 159 144 160 141 157155 156154150 153149 152148 151147 142 158 500 146 Selections Number of Teams Amount Bet REVERSE tEaSER caRd mark box as hown denotes home team PRO FOOTBALL - THURSDAY, NOVEMBER 15, 2012 1 BILLS ★ NFL PM8:25 2 DOLPHINS7– ½ 6– ½ PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012 3 REDSKINS ★ PM1:00 4 EAGLES10– ½ 3– ½ 5 PACKERS PM1:00 6 LIONS ★10– ½ 3– ½ 7 FALCONS ★ PM1:00 8 CARDINALS17– ½ 3+ ½ 9 BUCCANEERS PM1:00 10 PANTHERS ★7– ½ 6– ½ 11 COWBOYS ★ PM1:00 12 BROWNS14– ½ + ½ 13 RAMS ★ PM1:00 14 JETS10– ½ 3– ½ 15 PATRIOTS ★ PM4:25 16 COLTS17– ½ 3+ ½ 17 TEXANS ★ PM1:00 18 JAGUARS23– ½ 9+ ½ 19 BENGALS PM1:00 20 CHIEFS ★10– ½ 3– ½ 21 SAINTS PM4:05 22 RAIDERS ★12– ½ 1– ½ 23 BRONCOS ★ PM4:25 24 CHARGERS14– ½ + ½ 25 RAVENS NBC PM8:30 26 STEEL RS ★7– ½ 6– ½ PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012 27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½ 1,000 145 143 159 14 160 41 15715 156154150 153149 152148 51147 142 158 50 146 S lections Number of Teams Amount Bet ark box as sho n denotes home team PRO F OTBALL - THURSDAY, NOVEMBER 15, 2012 1 BILLS ★ NFL PM8:25 2 DOLPHINS7– ½ 6– ½ PRO F OTBALL - SUNDAY, NOVEMBER 18, 2012 3 REDSKINS ★ PM1:0 4 EAGLES10– ½ 3– ½ 5 PACKERS PM1:0 6 LIONS ★10– ½ 3– ½ 7 FALCONS ★ PM1:0 8 CARDINALS17– ½ 3+ ½ 9 BU CANEERS PM1:0 10 PANTHERS ★7– ½ 6– ½ 11 COWBOYS ★ PM1:0 12 BROWNS14– ½ + ½ 13 RAMS ★ PM1:0 14 JETS10– ½ 3– ½ 15 PATRIOTS ★ PM4:25 16 COLTS17– ½ 3+ ½ 17 TEXANS ★ PM1:0 18 JAGUARS23– ½ 9+ ½ 19 BENGALS PM1:0 20 CHIEFS ★10– ½ 3– ½ 21 SAINTS PM4:05 22 RAIDERS ★12– ½ 1– ½ 23 BRONCOS ★ PM4:25 24 CHARGERS14– ½ + ½ 25 RAVENS NBC PM8:30 26 STEELERS ★7– ½ 6– ½ PRO F OTBALL - MONDAY, NOVEMBER 19, 2012 27 49ERS ★ ESPN PM8:40 28 BEARS10– ½ 3– ½ 1,0 MARK BOX AS SHOWN DENOTES HOME TEAM PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016 1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS - 3½ PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016 FALCON - 9 1:00p 4 BUCCANEERS - 4½ 5 VIKINGS - 9½ 1:00p 6 TITANS - 4½ 7 EAGLES - 10½ 1:00p 8 BROWNS - 3½ 9 BENGALS - 9½ 1:00p 10 JETS - 4½ 11 SAINTS - 7½ 1:00p 12 RAIDERS - 6½ 13 CHIEFS - 14½ 1:00p 14 CHARGERS + ½ 15 RAVENS - 10½ 1:00p 16 BILLS - 3½ 17 TEXANS - 14 1:00p 18 BEARS + ½ 19 PACKERS - 12 1:00p 20 JAGUARS - 1½ 21 SEAHAWKS - 17½ 4:05p 22 DOLPHINS + 3½ 23 COWBOYS - 7½ 4:25p 24 GIANTS - 6½ 25 COLTS - 10½ 4:25p 26 LIONS - 3½ 27 CARDINALS nbc - 14½ 8:30p 28 PATRIOTS + ½ PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016 29 STEELERS espn - 10½ 7:10p 30 REDSKINS - 3½ 31 RAMS espn - 9 10:20p 32 49ERS - 4½
答案 0 :(得分:5)
OP的示例PDF中的不可见文本主要是通过定义剪辑路径(在文本的边界之外)和填充来使不可见路径(隐藏下面的文本)。因此,我们必须在文本提取期间考虑与路径相关的指令,以忽略不可见文本。
不幸的是,为PDFTextStripper
或其父类LegacyPDFStreamEngine
和PDFStreamEngine
声明了为这些说明设计的回调。
但是它们是在另一个主要PDFStreamEngine
子类PDFGraphicsStreamEngine
中声明的,并且它们在PageDrawer
中明智地实现。
为了利用这一点,我们可以复制&粘贴&将PageDrawer
实现修改为PDFTextStripper
的子类,例如像这样:
public class PDFVisibleTextStripper extends PDFTextStripper {
public PDFVisibleTextStripper() throws IOException {
addOperator(new AppendRectangleToPath());
addOperator(new ClipEvenOddRule());
addOperator(new ClipNonZeroRule());
addOperator(new ClosePath());
addOperator(new CurveTo());
addOperator(new CurveToReplicateFinalPoint());
addOperator(new CurveToReplicateInitialPoint());
addOperator(new EndPath());
addOperator(new FillEvenOddAndStrokePath());
addOperator(new FillEvenOddRule());
addOperator(new FillNonZeroAndStrokePath());
addOperator(new FillNonZeroRule());
addOperator(new LineTo());
addOperator(new MoveTo());
addOperator(new StrokePath());
}
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
super.processTextPosition(text);
}
private GeneralPath linePath = new GeneralPath();
void deleteCharsInPath() {
for (List<TextPosition> list : charactersByArticle) {
List<TextPosition> toRemove = new ArrayList<>();
for (TextPosition text : list) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
toRemove.add(text);
}
}
if (toRemove.size() != 0) {
System.out.println(toRemove.size());
list.removeAll(toRemove);
}
}
}
public final class AppendRectangleToPath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x = (COSNumber) operands.get(0);
COSNumber y = (COSNumber) operands.get(1);
COSNumber w = (COSNumber) operands.get(2);
COSNumber h = (COSNumber) operands.get(3);
float x1 = x.floatValue();
float y1 = y.floatValue();
// create a pair of coordinates for the transformation
float x2 = w.floatValue() + x1;
float y2 = h.floatValue() + y1;
Point2D p0 = context.transformedPoint(x1, y1);
Point2D p1 = context.transformedPoint(x2, y1);
Point2D p2 = context.transformedPoint(x2, y2);
Point2D p3 = context.transformedPoint(x1, y2);
// to ensure that the path is created in the right direction, we have to create
// it by combining single lines instead of creating a simple rectangle
linePath.moveTo((float) p0.getX(), (float) p0.getY());
linePath.lineTo((float) p1.getX(), (float) p1.getY());
linePath.lineTo((float) p2.getX(), (float) p2.getY());
linePath.lineTo((float) p3.getX(), (float) p3.getY());
// close the subpath instead of adding the last line so that a possible set line
// cap style isn't taken into account at the "beginning" of the rectangle
linePath.closePath();
}
@Override
public String getName() {
return "re";
}
}
public final class StrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.reset();
}
@Override
public String getName() {
return "S";
}
}
public final class FillEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "f*";
}
}
public class FillNonZeroRule extends OperatorProcessor {
@Override
public final void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "f";
}
}
public final class FillEvenOddAndStrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "B*";
}
}
public class FillNonZeroAndStrokePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
deleteCharsInPath();
linePath.reset();
}
@Override
public String getName() {
return "B";
}
}
public final class ClipEvenOddRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
getGraphicsState().intersectClippingPath(linePath);
}
@Override
public String getName() {
return "W*";
}
}
public class ClipNonZeroRule extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
getGraphicsState().intersectClippingPath(linePath);
}
@Override
public String getName() {
return "W";
}
}
public final class MoveTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 2) {
throw new MissingOperandException(operator, operands);
}
COSBase base0 = operands.get(0);
if (!(base0 instanceof COSNumber)) {
return;
}
COSBase base1 = operands.get(1);
if (!(base1 instanceof COSNumber)) {
return;
}
COSNumber x = (COSNumber) base0;
COSNumber y = (COSNumber) base1;
Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
linePath.moveTo(pos.x, pos.y);
}
@Override
public String getName() {
return "m";
}
}
public class LineTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 2) {
throw new MissingOperandException(operator, operands);
}
COSBase base0 = operands.get(0);
if (!(base0 instanceof COSNumber)) {
return;
}
COSBase base1 = operands.get(1);
if (!(base1 instanceof COSNumber)) {
return;
}
// append straight line segment from the current point to the point
COSNumber x = (COSNumber) base0;
COSNumber y = (COSNumber) base1;
Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
linePath.lineTo(pos.x, pos.y);
}
@Override
public String getName() {
return "l";
}
}
public class CurveTo extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 6) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x1 = (COSNumber) operands.get(0);
COSNumber y1 = (COSNumber) operands.get(1);
COSNumber x2 = (COSNumber) operands.get(2);
COSNumber y2 = (COSNumber) operands.get(3);
COSNumber x3 = (COSNumber) operands.get(4);
COSNumber y3 = (COSNumber) operands.get(5);
Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
}
@Override
public String getName() {
return "c";
}
}
public final class CurveToReplicateFinalPoint extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x1 = (COSNumber) operands.get(0);
COSNumber y1 = (COSNumber) operands.get(1);
COSNumber x3 = (COSNumber) operands.get(2);
COSNumber y3 = (COSNumber) operands.get(3);
Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
}
@Override
public String getName() {
return "y";
}
}
public class CurveToReplicateInitialPoint extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.class)) {
return;
}
COSNumber x2 = (COSNumber) operands.get(0);
COSNumber y2 = (COSNumber) operands.get(1);
COSNumber x3 = (COSNumber) operands.get(2);
COSNumber y3 = (COSNumber) operands.get(3);
Point2D currentPoint = linePath.getCurrentPoint();
Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
}
@Override
public String getName() {
return "v";
}
}
public final class ClosePath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.closePath();
}
@Override
public String getName() {
return "h";
}
}
public final class EndPath extends OperatorProcessor {
@Override
public void process(Operator operator, List<COSBase> operands) throws IOException {
linePath.reset();
}
@Override
public String getName() {
return "n";
}
}
}
请确保使用PDFVisibleTextStripper
构造函数中的内部运算符类,而不是PageDrawer
使用的具有相同名称的类。要确保只需按照代码下的链接。
这会将输出减少到
REVERSE tEaSER caRd
500
elections
er of Teams
t Bet
1,000
MARK BOX AS SHOWN
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
1 PANTHERS nbc - 10½ 8:30p 2 BRONCOS - 3½
PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
3 FALCONS - 9½ 1:00p 4 BUCCANEERS - 4½
5 VIKINGS - 9½ 1:00p 6 TITANS - 4½
7 EAGLES - 10½ 1:00p 8 BROWNS - 3½
9 BENGALS - 9½ 1:00p 10 JETS - 4½
11 SAINTS - 7½ 1:00p 12 RAIDERS - 6½
13 CHIEFS - 14½ 1:00p 14 CHARGERS + ½
15 RAVENS - 10½ 1:00p 16 BILLS - 3½
17 TEXANS - 14½ 1:00p 18 BEARS + ½
19 PACKERS - 12½ 1:00p 20 JAGUARS - 1½
21 SEAHAWKS - 17½ 4:05p 22 DOLPHINS + 3½
23 COWBOYS - 7½ 4:25p 24 GIANTS - 6½
25 COLTS - 10½ 4:25p 26 LIONS - 3½
27 CARDINALS nbc - 14½ 8:30p 28 PATRIOTS + ½
PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
29 STEELERS espn - 10½ 7:10p 30 REDSKINS - 3½
31 RAMS espn - 9½ 10:20p 32 49ERS - 4½
会丢弃大部分不需要的数据。
在this question的上下文中,很明显processTextPosition
和deleteCharsInPath
计算字符基线结尾的方式隐含地假定水平文本没有页面旋转。但是,如果放松了#34; Visibility&#34;的标准,如果其基线的开始可见,则可以假设一个角色是可见的。在这种情况下,不再需要计算Vector end
,并且代码也适用于旋转页面。