Question

我遇到了一个小问题。基本上我想从pdf文件中删除String-data。更具体地说明这个pdf文件

http://www.midttrafik.dk/koereplaner/bybusser/aarhus/bybusser-aarhus/18-mejlbyelev-park-all%C3%A9-skaade-moesgaard/koereplan

所以，我的问题在于不知道，如何获取名称和时间（pdf是公交站点的时间和位置，左侧kolon上的街道名称，以及公交车到达时间的剩余时间）。我要保存的信息是街道名称（1-4），街道名称和所有时间的数字。

翻译pdf上的一些内容。 Faste minuttal - 只是意味着'Faste'下的公共汽车时间是相同的 6.56 - 8.11 - 这意味着，在这个内部跟随下。所以巴斯将停在'Elev Skole，Høvej'56,11,26,41，意思是6.56,7.11,7.26,7.41,7.56,8.11。

我不认为我可以更好地解决我的问题，所以我希望你们中的一个人能够提供帮助。我不需要一个准备好的代码，只需向我发送严格的指示 - 告诉我我能做什么，迁移帮助，或使用好的模式。感谢

Answer 1

您可以使用此处的精美PDFBox库从此pdf文件中提取所需的文本。它的工作非常好，我在我最近的一个项目中使用它来索引pfd文件以进行全文搜索。

以下是项目的URL： http://pdfbox.apache.org/index.html

在那里你还会找到文档和一些如何从pdf中提取文本的例子。

示例代码：

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;

public class LittleExample {

 public static void main(String[] args){

 PDDocument pd;
 BufferedWriter wr;
 try {
         // this is your pdf from which you would like to extract the text
         File input = new File("/home/ottp/pdffiles/1.pdf");
         // this is the target file to store the extracted text
         File output = new File("/home/ottp/pdffiles/extracts/1.txt"); 
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());

         pd.save("CopyOfInvoice.pdf")
         PDFTextStripper stripper = new PDFTextStripper();
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // close and flush the output stream
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}

获取pdf的信息

1 个答案: