Question

我正在使用Tesseract读取日本文字。我从OCR看到下面的文字。

æ—¥ä»〜è«‹æ±，æ›¸

C ++代码

 extern "C" _declspec(dllexport) char* _cdecl Test(char* imagePath)
    {
        char *outText;

        tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
        // Initialize tesseract-ocr with English, without specifying tessdata path
        if (api->Init("D:\\tessdata", "jpn", tesseract::OcrEngineMode::OEM_TESSERACT_ONLY))
        {
            fprintf(stderr, "Could not initialize tesseract.\n");           
        }

        api->SetPageSegMode(tesseract::PageSegMode::PSM_AUTO);      
        outText = api->GetUTF8Text();

        return outText;
    }

c＃

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
        public static extern string Test(string imagePath);

        void Tessrect()
        {
            string result = Test("D:\\japan4.png");
            byte[] bytes = System.Text.Encoding.Default.GetBytes(result);
            MessageBox.Show(System.Text.Encoding.UTF8.GetString(bytes));
        }

输入文件：

以上代码在英语窗口中运行正常。但它在日本窗口中不起作用。在window的Japanes操作系统中给出错误的输出。

任何人都可以指导我如何正确设置日本窗口吗？

Answer 1

outText似乎已经采用UTF-8格式

outText = api->GetUTF8Text();

现在...从C ++返回byte[]（或类似名称）很痛苦...更改为：

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern IntPtr Test(string imagePath);

然后从here中提取StringFromNativeUtf8（因为即使转换为UTF-8 c字符串的IntPtr也是一件很麻烦的事。.NET本身没有任何东西那样）：

void Tessrect()
{
    IntPtr result = IntPtr.Zero;
    string result2;

    try
    {
        result = Test("D:\\japan4.png");
        result2 = StringFromNativeUtf8(result);
    }
    finally
    {
        Free(result);
    }

    MessageBox.Show(result2);
}

然后，您将不得不释放IntPtr ...另一个痛苦。

[DllImport(DllName, CallingConvention = CallingConvention.Cdecl)]
public static extern void Free(IntPtr ptr);

和

extern "C" _declspec(dllexport) void _cdecl Free(char* ptr)
{
    delete[] ptr;
}

Answer 2

您正在将UTF-8文本发送到非UTF-8的窗口。您需要先进行转换，然后才能显示

这是可能导致此问题的代码（因为它试图使用您无法控制的默认系统编码）； byte [] bytes = System.Text.Encoding.Default.GetBytes（result）;

您尝试在其中尝试使用Encoding.UTF8吗？

如果仅此一项不起作用，请尝试更改在下面的行中也将Encoding.UTF8转换为Encoding.Default。

Answer 3

您必须首先从imagePath创建图像对象。

就我而言，这是通过使用像opencv这样的著名代码来完成的。然后，使用SetImage功能。

void detectJpn(cv::Mat& img)
{
    char *outText;

    // Create Tesseract object
    tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();

    ocr->Init(NULL, "jpn", tesseract::OEM_TESSERACT_ONLY);

    // Set Page segmentation mode to PSM_AUTO (3)
    ocr->SetPageSegMode(tesseract::PSM_AUTO);

    ocr->SetImage((uchar*)img.data, img.size().width, img.size().height, img.channels(), img.step1());

    // Run Tesseract OCR on image
    outText = ocr->GetUTF8Text();

    // print recognized text
    std::cout << outText << std::endl; // Destroy used object and release memory ocr->End();

    //ocr->Clear();
    //ocr->End();

    delete ocr;
    ocr = nullptr;
}


int main(int argc, char *argv[])
{
    cv::Mat img = imread(argv[1], cv::IMREAD_UNCHANGED);

    detectJpn(img);     

    return 0;
}

如何在日语窗口操作系统中对日语文本进行编码？

3 个答案: