我的任务是从扫描的文档/ JPG中提取文本,然后仅获取以下提到的6个值,以便在下一个屏幕/活动中自动填写表单数据。
我在具有Blaze版本(收费)的Android应用中使用了google cloud vision api,结果是文本块,但是我只想从中提取一些信息,我怎么能做到这一点?
帐单或收据一直可以不同,但我想从Ex的所有发票文本框中提取6件事-
有没有可用的工具/第三方库,因此可以在我的android开发中使用。
注意-我认为不需要任何收据或账单图像样本,因为它可以是任何类型的账单或发票,我们只需要从提取的文本中提取6个提及的内容即可。
答案 0 :(得分:0)
在接下来的场景中,我将创建两个虚拟的票据格式,然后编写代码算法来解析它们。我只写算法,因为我不懂JAVA。
在第一列上,我们提供了两张钞票的精美图片。在第二栏中,我们提供了从OCR软件获得的文本数据。这就像一个简单的文本文件,没有实现任何逻辑。但是我们知道某些可以使其有意义的关键字。贝娄(Bellow)是一种算法,可以以完美的逻辑JSON转换无意义的文件。
// Text obtained from BILL format 1
var TEXT_FROM_OCR = "Invoice no 12 Amount 55$
Vendor name BusinessTest 1 Account No 1213113
Due date 2019-12-07
Description Lorem ipsum dolor est"
// Text obtained from BILL format 2
var TEXT_FROM_OCR =" BusinessTest22
Invoice no 19 Amount 12$
Account 4564544 Due date 2019-12-15
Description
Lorem ipsum dolor est
Another description line
Last description line"
// This is a valid JSON object which describes the logic behind the text
var TEMPLATES = {
"bill_template_1": {
"vendor":{
"line_no_start": null, // This means is unknown and will be ignored by our text parsers
"line_no_end": null, // This means is unknown and will be ignored by our text parsers
"start_delimiter": "Vendor name", // Searched value starts immediatedly after this start_delimiters
"end_delimiter": "Account" // Searched value ends just before this end_delimter
"value_found": null // Save here the value we found
},
"account": {
"line_no_start": null, // This means is unknown and will be ignored by our text parsers
"line_no_end": null, // This means is unknown and will be ignored by our text parsers
"start_delimiter": "Account No", // Searched value starts immediatedly after this start_delimiters
"end_delimiter": null // Extract everything untill the end of current line
"value_found": null // Save here the value we found
},
"description": {
// apply same logic as above
},
"due_date" {
// apply same logic as above
},
"invoice_number" {
// apply same logic as above
},
"amount" {
// apply same logic as above
},
},
"bill_template_2": {
"vendor":{
"line_no_start": 0, // Extract data from line zero
"line_no_end": 0, // Extract data untill line zero
"start_delimiter": null, // Ignore this, because our delimiter is a complete line
"end_delimiter": null // Ignore this, because our delimiter is a complete line
"value_found": null // Save here the value we found
},
"account": {
"line_no_start": null, // This means is unknown and will be ignored by our text parsers
"line_no_end": null, // This means is unknown and will be ignored by our text parsers
"start_delimiter": "Account", // Searched value starts immediatedly after this start_delimiters
"end_delimiter": "Due date" // Searched value ends just before this end_delimter
"value_found": null // Save here the value we found
},
"description": {
"line_no_start": 6, // Extract data from line zero
"line_no_end": 99999, // Extract data untill line 99999 (a very big number which means EOF)
"start_delimiter": null, // Ignore this, because our delimiter is a complete line
"end_delimiter": null // Ignore this, because our delimiter is a complete line
"value_found": null // Save here the value we found
},
"due_date" {
// apply same logic as above
},
"invoice_number" {
// apply same logic as above
},
"amount" {
// apply same logic as above
},
}
}
// ALGORITHM
// 1. convert into an array the TEXT_FROM_OCR variable (each index, means a new line in file)
// in JavaScript we would do something like this:
TEXT_FROM_OCR = TEXT_FROM_OCR.split("\r\n");
var MAXIMUM_SCORE = 6; // we are looking to extract 6 values, out of 6
foreach TEMPLATES as TEMPLATE_TO_PARSE => PARSE_METADATA{
SCORE = 0; // for each field we find, we increment score
foreach PARSE_METADATA as SEARCHED_FIELD_NAME => DELIMITERS_METADATA{
// Search by line first
if (DELIMITERS_METADATA['line_no_start'] !== NULL && DELIMITERS_METADATA['line_no_end'] !== NULL){
// Initiate value with an empty string
DELIMITERS_METADATA['value_found'] = '';
// Concatenate the value found across these lines
for (LINE_NO = DELIMITERS_METADATA['line_no_start']; LINE_NO <= DELIMITERS_METADATA['line_no_end']; LINE_NO++){
// Add line, one by one as defined by your delimiters
DELIMITERS_METADATA['value_found'] += TEXT_FROM_OCR[ LINE_NO ];
}
// We have found a good value, continue to next field
SCORE++;
continue;
}
// Search by text delimiters
if (DELIMITERS_METADATA['start_delimiter'] !== NULL){
// Search for text inside each line of the file
foreach TEXT_FROM_OCR as LINE_CONTENT{
// If we found start_delimiter on this line, then let's parse it
if (LINE_CONTENT.indexOf(DELIMITERS_METADATA['start_delimiter']) > -1){
// START POSITION OF OUR SEARCHED VALUE IS THE OFFSET WE FOUND + THE TOTAL LENGTH OF START DELIMITER
START_POSITION = LINE_CONTENT.indexOf(DELIMITERS_METADATA['start_delimiter']) + LENGTH( DELIMITERS_METADATA['start_delimiter'] );
// by default we try to extract all data from START_POSITION untill the end of current line
END_POSITION = 999999999999; // till the end of line
// HOWEVER, IF THERE IS AN END DELIMITER DEFINED, WE WILL USE THAT
if (DELIMITERS_METADATA['end_delimiter'] !== NULL){
// IF WE FOUND THE END DELIMITER ON THIS LINE, WE WILL USE ITS OFFSET as END_POSITION
if (LINE_CONTENT.indexOf(DELIMITERS_METADATA['end_delimiter']) > -1){
END_POSITION = LINE_CONTENT.indexOf(DELIMITERS_METADATA['end_delimiter']);
}
}
// SUBSTRACT THE VALUE WE FOUND
DELIMITERS_METADATA['value_found'] = LINE_CONTENT.substr(START_POSITION, END_POSITION);
// We have found a good value earlier, increment the score
SCORE++;
// break this foreach as we found a good value, and we need to move to next field
break;
}
}
}
}
print(TEMPLATE_TO_PARSE obtained a score of SCORE out of MAXIMUM_SCORE):
}
最后,您将知道哪个模板提取了大部分数据,并基于此模板将哪个数据用于该账单。随时在评论中提问。如果我停留45分钟来写这个答案,那么我也一定会回答您的评论。 :)