如何将Tika python与Tesseract OCR捆绑在一起?

时间:2017-04-27 08:28:52

标签: python apache ocr tesseract apache-tika

当我在终端呼叫它时它完美无缺! Node Traduction = document.getChildNodes().item(0); NodeList traductionChildNodes = Traduction.getChildNodes(); Node Sortie = null; for (int i = 0; i < traductionChildNodes.getLength(); i++) { Node node = traductionChildNodes.item(i); // here we check the node name if ("Sortie".equals(node.getNodeName())) { Sortie = node; break; } } NodeList sortieChildNodes = Sortie.getChildNodes(); // we got the texts in an array so we can access them one after another String[] texts = new String[] {"AAA", "001", "002", "BBB"}; // i is for the nodes, j is for the for (int nodeIndex = 0, textIndex = 0; nodeIndex < sortieChildNodes.getLength(); nodeIndex++) { Node node = sortieChildNodes.item(nodeIndex); // here we check the node type if (node.getNodeType() == Node.ELEMENT_NODE) { node.setTextContent(texts[textIndex++]); } }

但我正试图让它适用于tika

tesseract 1.jpg outPutFileHere -l fra 与相同的文字图像我没有tika的结果:( 你知道发生了什么吗?

谢谢

1 个答案:

答案 0 :(得分:0)

例如,您需要提供名为“ X-Tika-OCRLanguage”的标头:

headers = {
    "X-Tika-OCRLanguage": "eng+nor"
}
parsed = parser.from_file(path, headers=headers)