Tesseract - 错误net.sourceforge.tess4j.Tesseract - null

时间:2016-09-15 06:20:08

标签: java tomcat ocr tesseract tess4j

创建了一个使用Tesseract的java应用程序,以便将给定的图像或pdf转换为字符串格式,当在我的机器上运行它作为单元测试使用junit时运行良好但是在运行完整系统时这是一个restFul API由接收图像并运行Tesseract的tomcat运行它会给我以下错误:

  

23:22:36.511 [http-nio-9999-exec-3]错误   net.sourceforge.tess4j.Tesseract - null   java.lang.NullPointerException:null at   net.sourceforge.tess4j.util.PdfUtilities.convertPdf2Png(PdfUtilities.java:107)     在   net.sourceforge.tess4j.util.PdfUtilities.convertPdf2Tiff(PdfUtilities.java:48)     在   net.sourceforge.tess4j.util.ImageIOHelper.getIIOImageList(ImageIOHelper.java:343)     在net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:213)at   net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:197)at at   ocr.OcrUtil.getString(OcrUtil.java:54)at   com.tapd.server.api.handlers.IRSHandler.uploadIRSImage(IRSHandler.java:65)     在   com.tapd.server.api.WebAPIService.updateParentIrsForm(WebAPIService.java:250)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(未知来源)at   sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)at   java.lang.reflect.Method.invoke(未知来源)at   org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory $ 1.invoke(ResourceMethodInvocationHandlerFactory.java:81)     在   org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher $ 1.run(AbstractJavaResourceMethodDispatcher.java:144)     在   org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)     在   org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider $ ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160)     在   org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)     在   org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)     在   org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)     在   org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)     在   org.glassfish.jersey.server.ServerRuntime $ 2.run(ServerRuntime.java:309)     在org.glassfish.jersey.internal.Errors $ 1.call(Errors.java:271)at at   org.glassfish.jersey.internal.Errors $ 1.call(Errors.java:267)at   org.glassfish.jersey.internal.Errors.process(Errors.java:315)at   org.glassfish.jersey.internal.Errors.process(Errors.java:297)at   org.glassfish.jersey.internal.Errors.process(Errors.java:267)at   org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)     在   org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:292)     在   org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1139)     在   org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:460)     在   org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:386)     在   org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:334)     在   org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:221)     在   org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:230)     在   org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)     在   org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)     在   org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:192)     在   org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:165)     在   org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198)     在   org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:108)     在   org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:522)     在   org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140)     在   org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:79)     在   org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:620)     在   org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87)     在   org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:349)     在   org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:1110)     在   org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66)     在   org.apache.coyote.AbstractProtocol $ ConnectionHandler.process(AbstractProtocol.java:785)     在   org.apache.tomcat.util.net.NioEndpoint $ SocketProcessor.doRun(NioEndpoint.java:1425)     在   org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)     at java.util.concurrent.ThreadPoolExecutor $ Worker.run(Unknown Source)     在   org.apache.tomcat.util.threads.TaskThread $ WrappingRunnable.run(TaskThread.java:61)     at java.lang.Thread.run(Unknown Source)[2016-09-14 23:22:36,512]   [错误] java.lang.NullPointerException

我的猜测是tessdata文件夹不在正确的位置,当打包到Jar中并由tomcat运行时,它是错位的,但我无法确定它应该位于何处并且我已经进行了双重检查看到所有Jars都正确部署。

编辑:所以看来Tesseract无法处理远程服务器(如AWS S3)上的路径,所以问题是为什么?以及如何允许它使用S3的路径? (是的,该文件是公开的)

3 个答案:

答案 0 :(得分:5)

我的猜测是GhostscriptException未正确记录,这导致NullPointerException:

https://github.com/nguyenq/tess4j/blob/212d72bc2ec8b3a4d4f5a18f1eb01a0622fc5521/src/main/java/net/sourceforge/tess4j/util/PdfUtilities.java#L107

106        } catch (GhostscriptException e) {
107            logger.error(e.getCause().toString(), e);
108        } finally {

在第107行 - e.getCause()(可能)为null,调用null.toString()会抛出NPE。

(来自规范 - getCause可以为null: https://docs.oracle.com/javase/7/docs/api/java/lang/Throwable.html#getCause(),GhostscriptException也允许原因为null:http://grepcode.com/file/repo1.maven.org/maven2/org.ghost4j/ghost4j/1.0.0/org/ghost4j/GhostscriptException.java

要验证这个答案(不重新编译整个tess4j),您可以在调试模式下启动程序并在第107行放置一个断点。这将为您提供有关真实异常的信息。

答案 1 :(得分:1)

正如@Piotr R提到的那样,错误是ghostscriptException.getCause()为null,原因是发送给Tesseract的文件对象中配置的路径不是有效路径,现在对Tesseract有效的定义是与您的有点不同,他认为只有本地地址有效,所以当设置位于AWS S3上的文件时,即使它是公共的,也会引发错误。 解决方案是在本地保存它并在Tesseract完成后删除它。

答案 2 :(得分:0)

Resources I used: Windows 10 (tried on Windows Server 2016 as well), JAVA, MAVEN

Status: Working good on my local as well as VM 

1. Download  Tess4J-3.4.8  from here http://tess4j.sourceforge.net/  and set your ENV variable path under Advance System Setting 
2. Get repo from MAVEN - 

<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.5.1</version>
</dependency>
<dependency>
<groupId>org.ghost4j</groupId>
<artifactId>ghost4j</artifactId>
<version>1.0.1</version>
</dependency>
<dependency>
<groupId>net.sourceforge.lept4j</groupId>
<artifactId>lept4j</artifactId>
<version>1.7.0</version>
</dependency>

3. Get libtesseract302.dll and copy to "C:\Windows\System32" folder 
from here http://api.256file.com/libtesseract302.dll/en-download-56466.html
do not forget to set your ENV variable path under Advance System Setting  

4. Download and install Visual C++ 2015 Redistributable or VC++ 2017 Redistributable (I installed both )
from here https://programmer.help/blogs/net.sourceforge.tess4j.tesseractexception-java.lang.nullpointerexception.html 

then do restart your PC 

5. on Safer side can have some Jar files if you dont have already in local - Please see image

do not forget to set your ENV variable path for JARs under Advance System Setting 

enter image description here