Question

我使用 iText java TextExtraction 从PDF文件中读取文本。我使用下面的代码，适用于英文PDF格式现在我有PDF包含数据作为图像。我想从该图像中读取数据

/**
 * Generates RSA keys.
 */
private void generateRsaKeys(Context context, String rsaAlias) {
    try {
        // Set English locale as default (workaround)
        Locale initialLocale = Locale.getDefault();
        setLocale(Locale.ENGLISH);
        // Generate the RSA key pairs
        Calendar start = Calendar.getInstance();
        Calendar end = Calendar.getInstance();
        end.add(Calendar.YEAR, 30); // 30 years
        KeyPairGeneratorSpec spec = new KeyPairGeneratorSpec.Builder(context)
                .setAlias(rsaAlias)
                .setSubject(new X500Principal("CN=" + rsaAlias + ", O=Organization"))
                .setSerialNumber(BigInteger.TEN)
                .setStartDate(start.getTime())
                .setEndDate(end.getTime())
                .build();
        KeyPairGenerator kpg = KeyPairGenerator.getInstance(RSA, ANDROID_KEY_STORE);
        kpg.initialize(spec);
        kpg.generateKeyPair();
        // Reset default locale
        setLocale(initialLocale);
    } catch (NoSuchAlgorithmException | NoSuchProviderException | InvalidAlgorithmParameterException e) {
        Log.e(e, "generateRsaKeys: ");
    }
}

/**
 * Sets default locale.
 */
private void setLocale(Locale locale) {
    Locale.setDefault(locale);
    Resources resources = context.getResources();
    Configuration config = resources.getConfiguration();
    config.locale = locale;
    resources.updateConfiguration(config, resources.getDisplayMetrics());
}

Answer 1

您可以使用iText实施OCR工作流程。正如Amedee已经暗示的那样，这是我们在iText上尝试过的，结果很有希望。

算法（高级别）：

实施IEventListener以解析文档的页面
注意ImageRenderInfo事件，当PDF解析器点击图像时会触发它们
您可以在活动上致电getImage()并最终获得BufferedImage
将BufferedImage提供给Tesseract
应用坐标变换（tesseract不使用与iText相同的坐标空间）
现在您在图像中有texf和位置，您可以使用iText覆盖PDF上的文本。或者简单地提取它。

Answer 2

iText不支持OCR从图像中提取文本。尝试使用Tesseract或其他内容。

Answer 3

如果在线解决方案可以接受，您可以使用此在线PDF OCR API。每个文档的前3页是免费的。

如果您先提取图像，也可以使用other OCR APIs。

从PDF中读取图像中的数据

3 个答案: