我正在使用Tesseract OCR
和google-search-Api
为测验答案机器人(仅出于教育目的)编写Python程序。该程序在处理直接问题(“谁做了什么”,“这是什么”)时似乎非常准确,但是在问题中存在一些问题,其中包括答案本身(“其中哪些”)。
import pytesseract
from PIL import Image
from googleapiclient.discovery import build
import json
import unicodedata
import time
import os
#removing non-ASCII characters from OCR
def strip_accents(text):
text = unicodedata.normalize('NFD', text)\
.encode('ascii', 'ignore')\
.decode("utf-8")
return str(text)
#googling the question using Google-search-Api
def google_search(search_term, api_key, cse_id, **kwargs):
service = build("customsearch", "v1", developerKey=api_key)
res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
return res
#using Tesseract OCR
question = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/question.png'), lang = 'eng'))
answer1 = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/answer1.png'), lang = 'eng'))
answer2 = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/answer2.png'), lang = 'eng'))
answer3 = strip_accents(pytesseract.image_to_string(Image.open('/Users/lorenzo/Desktop/live_quiz/answer.png'), lang = 'eng'))
#creating three new questions by taking the original question and each of the answers
edited_question_1 = question + '? ' + '"' + answer1 + '"'
edited_question_2 = question + '? ' + '"' + answer2 + '"'
edited_question_3 = question + '? ' + '"' + answer3 + '"'
#searching each new question separately
result1 = google_search(edited_question_1, my_api_key, my_cse_id, num = 1)
result2 = google_search(edited_question_2, my_api_key, my_cse_id, num = 1)
result3 = google_search(edited_question_3, my_api_key, my_cse_id, num = 1)
#counting the search results for each google search
num_results_1=int(result1['searchInformation']['totalResults'])
num_results_2=int(result2['searchInformation']['totalResults'])
num_results_3=int(result3['searchInformation']['totalResults'])
目前,这种对三个新问题进行搜索的方法非常不准确,因为每个问题都是由原始问题加上结果中的一个创建的,因为结果的数量可能受许多其他因素影响,而这些因素并不涉及实际问题(例如,答案之一的受欢迎程度。)
我想知道你们中是否有人知道一种更好的方法来解决此问题以提高精度。