java中的多个pdf文件到txt

时间:2018-06-05 05:41:37

标签: java apache-tika

我使用pdfbox将pdf转换为txt,但我在一个文件夹中有多个文件需要在不同的txt文件中创建。我的源代码是

public class PDFconversion
{
          public static void main(final String[] args) throws IOException,SAXException, TikaException 
           {

              //Assume sample.txt is in your current directory

              File file = new File("sourcefile");

              //parse method parameters
              FileInputStream inputstream = new FileInputStream(file);
                BodyContentHandler handler = new BodyContentHandler();
                Metadata metadata = new Metadata();
                metadata.set("org.apache.tika.parser.pdf.sortbyposition", "true");
                ParseContext pcontext = new ParseContext();
                PDFParser pdfparser = new PDFParser();

                System.out.println("Parsing PDF to TEXT...");

                pdfparser.parse(inputstream, handler, metadata, pcontext);
              FileWriter fw=new FileWriter("targetfile");
      //parsing the file
                                    fw.write(handler.toString().trim());

                //System.out.println("Contents of the document:" + handler.toString());
        }
}

1 个答案:

答案 0 :(得分:1)

'java -jar tika-app.jar -t -i #input_dir#-o #output_dir#'怎么样?这将调用批处理模式,该模式将完整目录转换为带有.txt文件的镜像目录....或带有'-J'选项的.json文件