Question

我正在java中开发一个API，它接收pdf文件的位置作为参数。返回PDF文本内容。

我开始使用Apache Tika来提取文本。由于PDF可能很长，我想知道最好，最快，最准确的方法来获取文本。因为它是一个Restful API。我把它归还给json。

压缩文字？转换为字节？返回的最佳格式是什么？

代码示例：

        BodyContentHandler handler = new BodyContentHandler( );
        Metadata metadata = new Metadata( );
        InputStream inputstream = new URL( test ).openStream( );
        ParseContext pcontext = new ParseContext( );

        //parsing the document using PDF parser
        AutoDetectParser parser = new AutoDetectParser( );
        parser.parse( inputstream, handler, metadata, pcontext );

        //getting the content of the document
        System.out.println( "Contents of the PDF :" + handler.toString( ) );

我如何提取和发送大量文本？还是控制最大发送？ PDF包含数百个文本页面，最佳选择是什么？

谢谢。

在Restful服务中发送大量文本的最佳方式

0 个答案: