应用错误收集

我试图运行一个复杂的功能，该功能可以导入图像，执行预处理，然后运行几个pytesseract参数：

=1-
*C:
I
2
V7b
I

问题在于此功能需要运行数百万个图像，我认为最好的解决方案是在非常大的EC2实例（内核数量很多）上使用 multiprocessing 。

似乎有2种多重处理方法-Process和map_async-我都在下面给出代码：

def Extractor(file):
    # Import File
    # Perform pre-processing
    # Run pytesseract x 3 through the image
    # Concat results into a single pandas dataframe
    result = pd.concat([df1],[df2],[df3])

这两种解决方案似乎都随着样本数量的增加而变得更差，并且性能接近于顺序进行时的性能。这些代码中是否缺少一些逻辑？哪种方法对大约1m张图像列表更好？

用于大量图像的多处理Tesseract

0 个答案: