我有一个代码,可以在python中进行ocr或将pdf转换为txt,并使用基于正则表达式的方法来查找文档类别。我希望我的代码作为api公开。我正在使用烧瓶执行此任务。运行网址时出现404 Not Found错误。
我的文档类别提取代码如下:文件名是dtd.py
''
我的烧瓶api代码是这样的:它叫做app.py
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import re
import io
from PIL import Image
import pytesseract
from wand.image import Image as wi
def convert(fname, pages=None,encoding='utf-8'):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
if len(text)>=500:
regex3=re.search(r"\d+(?:[.-]\w+)*\s*(General Information|Process validation|Manufacturer(s)Reference Standards or Materials|Container Closure Systems|Stability Summary and Conclusions|Post Approval Stability Protocol and Stability Commitment)",text,re.IGNORECASE)
return regex3
else:
pdffile = wi(filename = fname, resolution = 300)
pdfImg = pdffile.convert('jpeg')
imgBlobs = []
for img in pdfImg.sequence:
page = wi(image = img)
imgBlobs.append(page.make_blob('jpeg'))
# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# extracted_text = []
for imgBlob in imgBlobs:
im= Image.open(io.BytesIO(imgBlob))
text2 = pytesseract.image_to_string(im, lang = 'eng')
regex3=re.search(r"\d+(?:[.-]\w+)*\s*(General Information|Manufacturer(s)|Process Validation|Batch Formula|Description of Manufacturing Process and Process Controls|Container Closure Systems|Stability Summary and Conclusions|Post Approval Stability Protocol and Stability Commitment)",text2,re.IGNORECASE)
return regex3
convert(r'D:\files\00ac4250-d746-4c8a-b3-2798b0c2d4f9.pdf')
dtd.py将返回类别示例“制造商”的名称,我想将其显示为rest api。如何有效地做到这一点
stacktrace中的500内部错误:
import dtd
from dtd import convert
from flask import Flask, request
from flask_restful import Resource, Api
#from flask.views import MethodView
app = Flask(__name__)
api = Api(app)
#convert(r'D:\files\67cecf40-71cf-4fc4-82e1-696ca41a9fba.pdf')
class dtdtext(Resource):
def get(self, result):
return {'data': dtd.convert(result)}
#api.add_resource(dtdtext, '/dtd/<result>')
categories=convert(r'D:\files\67cecf40-71cf-4fc4-82e1-696ca41a9fba.pdf')
@app.route('/dtd')
def returnResult():
return {'data': categories}
if __name__ == '__main__':
app.run()
答案 0 :(得分:1)
而不是class ChannelListenerTest {
private val val channelSender: ChannelSender = mock()
private val sut = ChannelListener(channelSender)
private val broadcastChannel = ConflatedBroadcastChannel<String>()
private val timeLimit = 1_000L
private val endMarker = "end"
@Test
fun `some description here`() = runBlocking {
whenever(channelSender.channel).thenReturn(broadcastChannel)
val sender = launch(Dispatchers.Default) {
broadcastChannel.offer("A")
yield()
}
val receiver = launch(Dispatchers.Default) {
while (isActive) {
val i = waitForEvent()
if (i == endMarker) break
yield()
}
}
try {
withTimeout(timeLimit) {
sut.listenToChannel()
sender.join()
broadcastChannel.offer(endMarker) // last event to signal receivers termination
receiver.join()
}
verify(foo).perform()
} catch (e: CancellationException) {
println("Test timed out $e")
}
}
private suspend fun waitForEvent(): String =
with(broadcastChannel.openSubscription()) {
val value = receive()
cancel()
value
}
}
,您应该声明一个函数,如下所示:
api.add_resource(dtdtext, '/dtd/<result>')
我并没有真正得到想要返回的内容,这一行返回的是从convert函数返回的类别。
如果要与REST API中的convert函数返回的可用类别匹配,请按如下所示编写路由:
categories=convert(r'D:\files\67cecf40-71cf-4fc4-82e1-696ca41a9fba.pdf')
@app.route('/dtd')
def returnResult()
return str({'data': categories})