Question

我有一个托管的C ++ DLL，我在C＃Application中使用它。 DLL正在处理大量图像（数千个）并使用OCR从中提取文本;即使我知道OCR Processing消耗了大量的CPU，但我想知道是否可以优化代码以获得更好的性能。

目前解析约需要一分钟。 15页PNG页面。我想要下降到30-40秒左右。

C ++代码：

    ISharedPreferences prefs = PreferenceManager.GetDefaultSharedPreferences (this);
        string product = prefs.GetString ("title","");
        string _weight = prefs.GetString ("weight","");
        string _price = prefs.GetString ("price","");

创建OCROBject类实例的C＃方法。 OCRObject是实际调用DLL的类，请参见下面的方法。

        char* OCRWrapper::GetUTF8Text(char* path, char* lang, char* imgPath)
        {
            char* imageText;
            tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();

            if (api->Init(path, lang)) {
                fprintf(stderr, "Could not initialize tesseract. Incorrect datapath or incorrect lanauge\n"); /*This should throw an error to the caller*/
                exit(1);
            }

            /*Open a reference to the imagepath*/
            Pix *image = pixRead(imgPath);

            /*Read the image object;*/
            api->SetImage(image);

            // Get OCR result
            imageText = api->GetUTF8Text();

            /*writeToFile(outText);*/
            /*printf("OCR output:\n%s", imageText);*/

            /*Destroy the text*/
            api->End();

            pixDestroy(&image);
            /*std::string x = std::string(imageText);*/

            return imageText;
        }

最后是OcrObject类：

  private void GetTextFromSavedImages(List<string> imagesPath)
    {
        try
        {
            StringBuilder allPagesText = new StringBuilder();
            OCRObject ocr = new OCRObject(this.dbHandler.GetApplicationSetting(this.m_ProfileName, "TesseractLanguage").ApplicationSettingValue, this.dbHandler.GetApplicationSetting(this.m_ProfileName, "TesseractConfigurationDataPath").ApplicationSettingValue); //Settings.Default.TesseractConfigurationDataPath
            for (int i = 0; i < imagesPath.Count; i++)
            {

                string pageText = ocr.GetOCRText(imagesPath[i]);
                this.m_pdfDictionary.Add(i + 1, pageText);
                allPagesText.Append(pageText);
            }
            this.AllPageText = allPagesText.ToString();
        }
        catch (Exception ex)
        {
            Logger.Log(ex.ToString(), LogInformationType.Error);
        }
    }

如果您需要更多详细信息，请与我们联系。

Answer 1

Tesseract FAQ建议人们并行运行其可执行文件（即暗示它是单线程的）。

您可以尝试使用Parallel.For替换for循环，看看是否可以快速而肮脏地赢得胜利。

编辑：他们已转移到GitHub，新的常见问题解答提示

如果Tesseract生成一页PDF，您将获得更好的结果文件并行，然后在最后将它们拼接在一起

C＃托管代码优化

1 个答案: