Question

我正在寻找一种从我的发票中提取数据的解决方案，以便向我的会计师发送摘要。

有些公司每月提供约20欧元的此类服务，发票通常得到很好的认可。但我尝试过的服务并没有提取我喜欢的所有数据，或者缺少像excel导出这样的功能来将数据发送给我的会计师。并且每月支付20欧元并且每月必须管理5张发票的另一项服务对我来说并没有吸引力。

我正在研究一下，发现这个stackoverflow问题： Can anyone recommend OCR software to process invoices?

它有点过时，希望找到更多最新的建议。我尝试了Ephesoft社区版，起初看起来非常有前途。但该软件有一个学习和审查步骤。在审核步骤中，数据似乎没有反馈到学习步骤。再加上它比手工操作更麻烦。我认为它是为大企业制作的。

我正在寻找一个简单的数据提取软件，它可以通过我展示的每个步骤来学习。

我也看过Apache Tika，但它似乎还没有准备好使用简单的网络界面。

1）您对付费OCR服务有什么建议吗？灵活提取增值税总额/增值税％/总金额/总金额货币/增值税货币/支付的账户/公司名称。导出到excel？

2）您对开源软件有什么建议吗？

3）您是否对如何处理少数（每年不到50张）发票有一些一般性的建议？

非常感谢，

托比

Answer 1

除了原始OCR和正则表达式之外（对于某些非常有限的用例，它可能正常工作），还有其他一些提供API访问的选项。那些你可以在没有任何演示或销售过程的情况下开始使用的那些：

TagGun - 专门针对收据，也可以提取订单项，每月免费提取50张收据
Elis - 专门针对发票，自动支持各种模板（经过预先培训的机器学习模型），每月免费提供300张以下发票

如果您愿意完成销售流程（而且它们实际上似乎是真实的和现场的）：

LucidTech 和 Itemize （不确定它们的准确性是什么以及它们提取的字段是什么，因为它们的API详细信息是非公共）
FlexiCapture Engine - 基于模板，如果您愿意为每种特定发票格式定义一个

（免责声明：我与Elis的供应商Rossum有联系。可以随意建议编辑添加其他API！）

Answer 2

如果您正在寻找免费的OCR服务，那么

谷歌云视觉（每月免费转换2,000次，无PDF支持）
microsoft ocr（每月2,000次转换，无PDF支持）
ocr.space（每月25,000次转换，包含 PDF支持）

...但它们都只返回原始ocr数据。也许您可以使用REGEX来获取所需的数据？这可能取决于发票的复杂程度。

另一种方法可以是使用Kantu Web Automation。它是一个浏览器自动化软件，也可以extract data from PDF以可视方式进行区域OCR（您使用带有绿色和粉红色框的数据标记区域）。这可能适用于您的发票OCR用例。 Kantu Community Edition是免费的软件。

Answer 3

Sypht提供了用于执行此操作的API：http://www.sypht.com。

Python客户端：https://github.com/sypht-team/sypht-python-client

第1步

pip install sypht

第2步

from sypht.client import SyphtClient, Fieldset

sc = SyphtClient('<client_id>', '<client_secret>')

with open('invoice.png', 'rb') as f:
    fid = sc.upload(f, fieldsets=["document, "invoice"])

print(sc.fetch_results(fid))

免责声明：我与供应商有关联

Answer 4

允许我通过自动方式（使用机器学习）为发票添加API链接。

尝试一下，演示了一个演示：https://rossum.ai/developers

现在可以使用如下所示的API（https://docs.api.rossum.ai/）自动执行提取过程：

from __future__ import division, print_function

import argparse
import json
import os

import requests
import polling

DEFAULT_API_URL='https://all.rir.rossum.ai'

class ElisClient(object):
    """
    Simple client for Rossum Elis API that allows to submit a document for
    extraction and then wait for the processed result.
    Usage:
    ```
    client = ElisClient(secret_key, base_url)
    document_id = client.send_document(document_path)
    extracted_document = client.get_document(document_id)
    ```
    """
    def __init__(self, secret_key, url=DEFAULT_API_URL):
        self.secret_key = secret_key
        self.url = url
        # we do not use requests.auth.HTTPBasicAuth
        self.headers = {'Authorization': 'secret_key ' + self.secret_key}

    def send_document(self, document_path):
        """
        Submits a document to Elis API for extractions.
        Returns: dict with 'id' representing job id
        """
        with open(document_path, 'rb') as f:
            content_type = self._content_type(document_path)
            response = requests.post(
                self.url + '/document',
                files={'file': (os.path.basename(document_path), f, content_type)},
                headers=self.headers)
        return json.loads(response.text)

    @staticmethod
    def _content_type(document_path):
        return 'image/png' if document_path.lower().endswith('.png') else 'application/pdf'

    def get_document_status(self, document_id):
        """
        Gets a single document status.
        """
        response = requests.get(self.url + '/document/' + document_id, headers=self.headers)
        response_json = json.loads(response.text)
        if response_json['status'] != 'ready':
            print(response_json)
        return response_json

    def get_document(self, document_id, max_retries=30, sleep_secs=5):
        """
        Waits for document via polling.
        """
        def is_done(response_json):
            return response_json['status'] != 'processing'

        return polling.poll(
            lambda: self.get_document_status(document_id),
            check_success=is_done,
            step=sleep_secs,
            timeout=int(round(max_retries * sleep_secs)))

def parse_args():
    parser = argparse.ArgumentParser(description='Elis API client example.')
    parser.add_argument('document_path', metavar='DOCUMENT_PATH',
                        help='Document path (PDF/PNG)')
    parser.add_argument('-s', '--secret-key', help='Secret API key')
    parser.add_argument('-u', '--base-url', default=DEFAULT_API_URL, help='Base API URL')

    return parser.parse_args()

def main():
    args = parse_args()
    client = ElisClient(args.secret_key, args.base_url)
    print('Submitting document:', args.document_path)
    send_result = client.send_document(args.document_path)
    document_id = send_result['id']
    print('Document id:', document_id)
    extracted_document = client.get_document(document_id)
    print('Extracted data:')
    print(json.dumps(extracted_document, indent=4))

if __name__ == '__main__':
    main()

称为

python elis_client_example.py ../data/invoice.pdf -s xxxxxxxxxxxxxxxxxxxxxx_YOUR_ELIS_API_KEY_xxxxxxxxxxxxxxxxxxxxxxx

（来自https://github.com/rossumai/elis-client-examples/的示例）

为了补充说明，我是致力于为开发人员提供这种支持的团队的一部分。

Answer 5

签出Veryfi 它会在3-5秒内从收据和发票（包括订单项）中提取50多个字段。

它可以直接使用（即无需培训），具有高精度结果，并且支持30多种语言/地区。

> pip install veryfi

veryfi_client = Client(client_id, client_secret, username, api_key)

categories = ['Grocery', 'Utilities', 'Travel'] # list of your categories

file_path = '/tmp/invoice.jpg'

response = veryfi_client.process_document(file_path, categories=categories)

print (response)

这里是如何使用它的详细概述： https://www.veryfi.com/engineering/invoice-data-capture-api/

*我是Veryfi的联合创始人，所以请随时提出任何问题

Answer 6

从发票中提取数据是一个复杂的问题。我还没有任何开源解决方案。 OCR只是数据提取过程的一部分。您需要图像预处理，用于数据识别的AI引擎等。

您有许多解决此问题的方法。他们每个人都有点不同。 @Peter Baudis已经提到了其中一些。

它们非常简单：

OCR SPACE Receipt scanning-以表格格式提取数据，但您仍然需要解析它们并确定文本的哪一部分是发票编号

更高级：

Nanonets-机器学习API的许多解决方案（发票，税表等）
typless-适用于任何文档（发票，采购订单等）的单次调用API，每月免费50张发票
Parascript-模板系统，类似于Abby FlexiCapture

重要的是要知道您的用例。没有万能的解决方案。这取决于您要实现的目标：

数据挖掘-它必须便宜且快速。丢失或不正确的数据不是关键任务。您可以在数据分析中清理它。
企业中的自动化-经过培训的重复发票必须几乎100％有效。速度和新发票不是关键任务。
例如海关中的自动化-必须尽可能多地返回返回的数据。整个系统的准确性至关重要，但是无论如何，每个文档都可能会受到审查。

因此，您应该对其进行测试，并查看它们如何适合您的流程/需求。

免责声明：我是typty的创造者之一。随时提出修改建议。

Answer 7

您可以尝试使用Nanonets，此Github存储库中有一个示例：

https://github.com/NanoNets/invoice-processing-with-python-nanonets

import requests, os, sys, json

model_id = "Your Model Id"
api_key = "Your API Key"
image_path = "Invoice Path"

url = 'https://app.nanonets.com/api/v2/ObjectDetection/Model/' + model_id + '/LabelFile/'

data = {'file': open(image_path, 'rb'),    'modelId': ('', model_id)}

response = requests.post(url, auth=requests.auth.HTTPBasicAuth(api_key, ''), files=data)

print(json.dumps(json.loads(response.text), indent = 2))

发票自动数据提取OCR或PDF

7 个答案: