Question

我正在制作名片阅读器App。我正在实施Tesseract OCR以从图像中获取文本。我得到了名片上打印的所有文字，如

马克亨利（姓名）
助理教授（职业）
XYZ大学（雇主）。

但是如何确定哪个文本是用户名，哪一个是用户的公司，哪一个是他的职称。有没有这个或什么算法。

P.S。以上序列可以更改。

Answer 1

这对于自然语言处理来说是一个理想的问题，在这里你可以训练一个分类器来推测任何与教授，＆＃39;助理等等的东西更有可能是一份工作描述，带有“马克”，“安德鲁”等的文字很可能是一个名字。这是模糊逻辑，最好是猜测。

示例 - http://textblob.readthedocs.org/en/latest/classifiers.html

>>> train = [
...     ('I love this sandwich.', 'pos'),
...     ('this is an amazing place!', 'pos'),
...     ('I feel very good about these beers.', 'pos'),
...     ('this is my best work.', 'pos'),
...     ("what an awesome view", 'pos'),
...     ('I do not like this restaurant', 'neg'),
...     ('I am tired of this stuff.', 'neg'),
...     ("I can't deal with this", 'neg'),
...     ('he is my sworn enemy!', 'neg'),
...     ('my boss is horrible.', 'neg')
... ]
>>> test = [
...     ('the beer was good.', 'pos'),
...     ('I do not enjoy my job', 'neg'),
...     ("I ain't feeling dandy today.", 'neg'),
...     ("I feel amazing!", 'pos'),
...     ('Gary is a friend of mine.', 'pos'),
...     ("I can't believe I'm doing this.", 'neg')
... ]

在名片阅读中确定标题

1 个答案: