我在google-app-engine上运行应用程序。
尝试在google-cloud-storage上从pdf文件中获取txt。
当我在本地运行我的代码时会成功,但是当在appengine上运行时,它会失败并显示org.pdfbox.exceptions.WrappedIOException
这是我的代码:
import com.google.cloud.storage.*;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class Download {
public static String perform(String bucket, String file) throws IOException {
byte[] fileByte = download(bucket, file);
String pdfFileTxt = pdf2txt(fileByte);
return pdfFileTxt;
}
public static byte[] download(String bucketName, String fileId) throws IOException {
Storage storage = StorageOptions.getDefaultInstance().getService();
BlobId blobId = BlobId.of(bucketName, fileId);
Blob blob = storage.get(blobId);
return blob.getContent();
}
public static String pdf2txt(byte[] byteArr) throws IOException {
InputStream stream = new ByteArrayInputStream(byteArr);
PDFParser parser = new PDFParser(stream);
parser.parse();
PDDocument pdDoc = new PDDocument(parser.getDocument());
return new PDFTextStripper().getText(pdDoc);
}
}
代码在parser.parse();
上org.pdfbox.exceptions.WrappedIOException
失败 - 没有添加其他消息:(
从存储中下载 - 实际上是成功的。如果我记录数据,我得到类似的东西:
%PDF-1.3
%����
7 0 obj
<</Linearized 1/L 7945/O 9/E 3524/N 1/T 7656/H [ 451 137]>>
endobj
13 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<4DC91A1875A6D707AEC203BB021C93A0><F6C92B368A8A13408457A1D395A37EB9>]/Index[7 21]/Info 6 0 R/Length 52/Prev 7657/Root 8 0 R/Size 28/Type/XRef/W[1 2 1]>>stream
h�bbd``b`� ��H0� 6G ��#�4�,#��Ɲ_ L��
endstream
endobj
startxref
0
%%EOF
... more ...
现在无论如何要克服这个问题?也许使用不同的库?由于代码在appengine上运行 - 跟踪这些错误非常困难。
答案 0 :(得分:0)
PdfBox does not run on GAE。它使用不允许的java类。
作为一个解决方法你可以下载一个修改过的pdfbx jar。 感谢icyerasor的answer并使用these说明。
以下是完整的说明:
下载此folder 现在进入你的项目目录并运行以下命令:
mkdir local-maven-repo
mvn deploy:deploy-file -DgroupId=org.apache.pdfbox -DartifactId=pdfbox -Dversion=1.8.0-SNAPSHOT -Durl=file:./local-maven-repo/ -DrepositoryId=local-maven-repo -DupdateReleaseInfo=true -Dfile=/your/path/to/download/directory/pdfbox-GAE/pdfbox-1.8.0-SNAPSHOT.jar
mvn deploy:deploy-file -DgroupId=org.apache.pdfbox -DartifactId=fontbox -Dversion=1.8.0-SNAPSHOT -Durl=file:./local-maven-repo/ -DrepositoryId=local-maven-repo -DupdateReleaseInfo=true -Dfile=/your/path/to/download/directory/pdfbox-GAE/dependencies/fontbox-1.8.0-SNAPSHOT.jar
在项目中的pom中添加:
<repositories>
<repository>
<id>local-maven-repo</id>
<url>file:///${project.basedir}/local-maven-repo</url>
</repository>
</repositories>
现在在pom中编辑您的依赖项:
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-logging/commons-logging-api -->
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging-api</artifactId>
<version>1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-logging/commons-logging-adapters -->
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging-adapters</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>fontbox</artifactId>
<version>1.8.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>1.8.0-SNAPSHOT</version>
</dependency>
最后是工作代码:
import com.google.cloud.storage.*;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class Download {
public static String perform(String bucket, String file) throws Exception {
byte[] fileByte = download(bucket, file);
String pdfFileTxt = pdf2txt2(fileByte);
return pdfFileTxt;
}
public static byte[] download(String bucketName, String fileId) throws IOException {
Storage storage = StorageOptions.getDefaultInstance().getService();
BlobId blobId = BlobId.of(bucketName, fileId);
Blob blob = storage.get(blobId);
return blob.getContent();
}
public static String pdf2txt2(byte[] byteArr) throws IOException {
InputStream myInputStream = new ByteArrayInputStream(byteArr);
PDDocument pddDoc = PDDocument.load(myInputStream);
PDFTextStripper reader = new PDFTextStripper();
String pageText = reader.getText(pddDoc);
pddDoc.close();
return pageText;
}
}