我想使用tesseract OCR在Visual Studio C#上使用c#在屏幕的矩形区域上获取文本。
首先,在Visual Studio C#中使用tesseract需要什么?我是新手使用Visual Studio并设置包装器。在Google上搜索了几个小时后,我发现我需要:a wrapper(charlesw),一个来自官方网站的语言包。我是否还需要安装windows tesseract-ocr?
我已经按照charlesw的GitHu上的步骤在我的项目中设置了包装器。但是,我仍然不确定如何使用这些功能。
我假设这是如何声明OCR引擎:
TesseractEngine engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default);
要分析屏幕上的矩形区域,我可以捕获某个区域的屏幕,然后将其保存在.bmp或.tif中。接下来,使用引擎分析图像。
engine.[unkwonapi](imagepath); //what is the api name going to be? I tried to look it up [here][2].
或者,有些人说可以通过使用tesseract的api来完成,我们可以输入矩形区域的协调。
答案 0 :(得分:2)
包装器捆绑Tesseract DLL(作为libtesseract302.dll
)。你不需要安装windows tesseract-ocr;事实上,你不应该,因为它可以干扰包装。
您可以使用以下任一方法在图像上指定感兴趣的区域:
engine.Process(Bitmap image, Rect region, PageSegMode? pageSegMode = null)
或
engine.Process(Pix image, Rect region, PageSegMode? pageSegMode = null)
答案 1 :(得分:0)
这是我的过程。我首先必须将PDF栅格化(可能不是您的要求)
1。)安装Ghostcript 9.26 from here更高版本无法进行下一步
2。)安装Ghostscript.NET NuGet Install-Package Ghostscript.NET -Version 1.2.1
3。)安装Tesseract NuGet Install-Package Tesseract -Version 3.3.0
这是我使用Ghostscript.NET的PDF栅格化例程
public static List<MemoryStream> GetPdfImages(FileInfo pdfFile, DirectoryInfo workingDir, string fileNamingToken, TextWriter _logger)
{
int desired_x_dpi = 150;
int desired_y_dpi = 150;
string inputPdfPath = pdfFile.FullName;
var streams = new List<MemoryStream>();
using (var rasterizer = new GhostscriptRasterizer())
{
GhostscriptVersionInfo gsVersionInfo = GhostscriptVersionInfo.GetLastInstalledVersion(GhostscriptLicense.GPL | GhostscriptLicense.AFPL, GhostscriptLicense.GPL);
try
{
rasterizer.Open(inputPdfPath, gsVersionInfo, true);
}
catch (Ghostscript.NET.GhostscriptAPICallException exc)
{
_logger.WriteLine("There is an issue with this version of Ghostscript or how Ghostscript was installed. As of Winter 2020, GS 9.26 will work the best with Ghostscript.NET");
}
for (var pageNumber = 1; pageNumber <= rasterizer.PageCount; pageNumber++)
{
var memoryStrm = new MemoryStream();
var img = rasterizer.GetPage(desired_x_dpi, desired_y_dpi, pageNumber);
//save to a memory stream to be returned
img.Save(memoryStrm, System.Drawing.Imaging.ImageFormat.Tiff);
//or save to the file system to see how well it's working
img.Save($"{workingDir.FullName}\\{fileNamingToken}_{pageNumber}.TIF");
_logger.WriteLine($"Image Dimensions: {img.Width} x {img.Height}");
streams.Add(memoryStrm);
}
}
return streams;
}
一旦我创建了一个内存流列表,我选择遍历它们,并使用Tesseract对它们进行OCR矩形处理。如果您有很多文件要处理,则不应该一遍又一遍地调用引擎..您应该将引擎放在其他地方
var _engine = new TesseractEngine("./tessdata", "eng", EngineMode.Default, "letters");
var topHalfPageRect = Rect.FromCoords(1, 1, 1275, 825);//at 150 DPI, get top of 8.5x11 page
for(int i =0;i< _streams.Count;i++)
{
var imgStm = _streams[i];//my list of memorystreams created by Ghostcript 9.26
imgStm.Position = 0;//set memorystream playhead back to start
using (var imageWithText = Pix.LoadTiffFromMemory(imgStm.ToArray()))
{
using (var page = _engine.Process(imageWithText, topHalfPageRect , PageSegMode.SparseText))
{
var text = page.GetText();
var processedText = text.Replace("\n", "").Trim();
Console.WriteLine(processedText);
if (MyRegexPatterns.Pattern1.IsMatch(processedText))
{
Console.WriteLine("*** FOUND IT!! ***");
}
}
}
imgStm.Dispose();//but not matter what, disppose of the stream now
}