Question

我正在使用以下代码阅读日语。但是即使使用unicode转换，它也会产生垃圾字符。你能指导我如何使它正确吗？

void Test(char* imagePath)
{
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
    {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead(imagePath);
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    // Destroy used object and release memory
    api->End();
    delete[] outText;
    pixDestroy(&image);
}

使用下面链接中的测试数据 https://github.com/tesseract-ocr/tessdata

测试图像

为什么tesseract为日语赋予垃圾价值？

0 个答案: