如何将图像变压器整合到露天

时间:2016-02-19 14:33:24

标签: ocr alfresco tesseract

我在Alfresco中使用了tesseract变换器,以便可以搜索TIFF图像。我已经找到了许多关于这个的教程,我已经在我的Alfresco上试过但是它没有用。

这是示例tesseract integration

我在使用Alfresco Enterprise v5.0.2

似乎变压器没有集成,我上传了tiff图像但没有导致搜索单词。 如何检查变压器是否已应用?

1 个答案:

答案 0 :(得分:0)

安装TESSTRACT OCR: 从(https://code.google.com/p/tesseract-ocr/downloads/list)下载tesseract 然后双击tesseract-ocr-setup-3.02.02.exe安装它。

在系统“C:\ Program Files(x86)\ Tesseract-OCR”中安装tesseract后,将使用已安装的Tesseract OCR创建路径。

ALFRESCO已做出改变 要添加的文件。 1)OCR.bat2)ocrpng变换-context.xml3)ocrjpeg变换-context.xml4)ocrtiff变换-context.xml5)露天-的tesseract-search.jar6)ocrtransform.log 1)OCR.bat

REM to see what happens
echo from %1 to %2 >>C:\tmp\ocrtransform.log


copy /Y %1 C:\TMP\%~n1%~x1

REM  call tesseract and redirect output to $TARGET
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" C:\TMP\%~n1%~x1 %~d2%~p2%~n2 -l eng
del C:\TMP\%~n1%~x1

这个批处理脚本将放在你的露天路径“C:\ Alfresco”中

这个批处理脚本会将上传的文件发送到Tesseract ocr进行实际的OCR,将日志复制到ocrtransform.log,Tesseract OCR将内容发送到alfresco,我们可以更改上面文件默认给出的实际语言eng ,我们可以为此提供多种语言。

这些转换xml将添加到“C:\ Alfresco \ tomcat \ shared \ classes \ alfresco \ extension”

2)ocrpng变换-context.xml中

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.jpeg" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
  <value>/C</value>
  <value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
  <value>/C</value>
  <value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>
        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/png</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>
      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.jpeg" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.jpeg" />
    </property>
  </bean>
</beans>

3)ocrjpeg变换-context.xml中

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
  <value>/C</value>
  <value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
    <value>/C</value>
    <value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>

        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/jpeg</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>


      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.tiff" />
    </property>
  </bean>
</beans>

4)ocrtiff变换-context.xml中

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans default-lazy-init="false" default-autowire="no" default-dependency-check="none">
  <bean id="transformer.worker.ocr.tiff" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" lazy-init="default" autowire="default" dependency-check="default">
    <property name="mimetypeService">
      <ref bean="mimetypeService" />
    </property>
    <property name="checkCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
  <value>/C</value>
  <value>dir c:\Alfresco\ocr.bat</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1</value>
        </property>
      </bean>
    </property>
    <property name="transformCommand">
      <bean class="org.alfresco.util.exec.RuntimeExec" lazy-init="default" autowire="default" dependency-check="default">
        <property name="commandsAndArguments">
          <map>
            <entry key="Windows.*">
              <list>
                <value>C:\Windows\System32\cmd.exe</value>
  <value>/C</value>
  <value>C:\Alfresco\ocr.bat</value>
                <value>"${source}"</value>
                <value>"${target}"</value>
              </list>
            </entry>
          </map>
        </property>
        <property name="errorCodes">
          <value>1,2</value>
        </property>
      </bean>
    </property>
    <property name="explicitTransformations">
      <list>

        <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" lazy-init="default" autowire="default" dependency-check="default">
          <property name="sourceMimetype">
            <value>image/tiff</value>
          </property>
          <property name="targetMimetype">
            <value>text/plain</value>
          </property>
        </bean>


      </list>
    </property>
  </bean>
  <bean id="transformer.ocr.tiff" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer" lazy-init="default" autowire="default" dependency-check="default">
    <property name="worker">
      <ref bean="transformer.worker.ocr.tiff" />
    </property>
  </bean>

</beans>

这些都是我们可以编写的转换文件。基于想要使用Tesseract进行OCR的文件的类型格式。

5)露天-的tesseract-search.jar 从这个链接下载这个罐子[{https://docs.google.com/file/d/0B94FD2QmPSJCNHpuUVlicW95UjA/edit)][1] 并将此jar放在此路径“C:\ Alfresco \ tomcat \ lib”中。 6)ocrtransform.log 使用“C:\ TMP”

中的ocrtransform.log创建一个空文件名

之后重启露天

然后上传图像格式的文件,图像的内容将在露天索引,以便我们可以搜索文件的内容。