电子文本和印刷文本之间的区别

时间:2018-09-20 23:07:30

标签: ocr tesseract

我一直在尝试使用Tesseract api。我注意到api可以识别计算机屏幕上的文本和打印页面上的文本之间的区别。例如,这是从我备有的考试准备书中从电子版本的问题中检测到的结果的输出:

paper. five students-liang, xramer. topaz. wegregian, and otweill-eachaaf\n\nglraviewpnne or more of exactly three plays: sunset, lamerlane, and undulation, but do not\ntraviaw7any other playsu lhe following conditions must apply:\n\nttkramar and tape: each review fewer of the plays than megregian.\n\ng,neither tape: nor negragian reviews any play liang reviews.\n\nglxramar and otweill both review lamerlane. 6xactly two of the students review exactly the\n\ntisame\n\n.lplay or plays as each nther.\n\n

并非完全符合条件,但使用英文字符,以下是印刷书中此文字图片的输出

l .r l r t. nw r r l , 1w,   y , y\n r v at? n -- r l a lvwfifwbv aw.    \n4 rrf,,.  fa4n    , t ,4? v . , l, a! .i ,w v , 4,31, .7? \nv 4r w 4. mtw air  . 1 -\na ,rwitf . 73w, .6 41, a? rag? . z   a f . 15,. -u mm 9,143\n. arr, en l l  t yriaxfi ltx  9.. , .51. a, l m: -rly z l a\n4i?  ti:  ft mt t k wvw y t 491.1,: x,  ,, 3\n,w .v .1 i fat if  31-1, y? l- -:- t 0m f, 1,, aw 4,1. ,,,+ +  .ft , u!\n.u n, . a a. v, r,.  .. , , d. . , , . ,1 wk.\n? fa ? - l. fl - l :l 1, a , l ,l ,2: i  l\n9 a .31 1 a v .1 gr  i w , v , r, v 1\n.   i-t . lg! lfy w v-l a g ,5 it 1, 135a l f t v  t l 4 c\n , 4, linketahydl , 3v ., - tl l . v, . f t t   v w i l\naid-n. : ya,er 9g!  . ,f . , ., i 4 l la  . , v\n1.1 g  . ll . 4 i  , w l .   , . , , in\nafff .1 in r w  l l ,4 i\nt :17 . w i a. l . v .,\n. 1:51.  :  tub?\n w ., tint  - .\nt l .3. l huh raw . . t 2x139. v tt\n   -  \nagain :35 apt-3333!? y 1 :3 mu ,  \n. :rtzp-nf-ia..?3411.552.: gt .\n  x 3:6,3- sailrxxfyzczij ,\n. .v  t\n. :    4 l \n.jr.,..::l,sr-3::,:5 1t u....-iv.w::.:i-kfj- l -\nuesnons 19-3 . -   ,. .\nn .3392:  t\n ltzli aglitvifs t l\n5. , .  .. .is-gvfz4fw1g 5 v\n. .    gag t.\n 5w  tun-mural . i-fi2,,35y\n. a.  may\n.. .7. .5. 443- 13:337.\n13m the school a r fiv a r .    \nt- a , xl h u\n, l . - yxlnfyyztwlrvlklnt-t\nl. 11-,-ira,,i,ot 3x161 :. j a . .\nv  - :5 5. f ,  \nhie an an e a.  aqwgam  . u\n, .  9 i: l 4, 5,3. i \n  3,13 5 :ii- t t\n5:??? , :jf :3 ,\n. l . a i j , 5:? tint - t v\n. . l, g bf .4 , 1 4,  \n., ,4 n  v .t.   v\na mm  .:   ii a.  \no n? l t. \nv w y r at  a\nre 16 an . . . w  ,\nt  m3ftga 2:12? 3 g t ., ,\ner and l ! dalttlawf. - l  ,\na  v.  -r .7 .34\n. l i.-.::: an: -s.y l\ne an,      w\no g k a jtialiict-itiv... a, 1 f ,\n27v t2 w. , f \nf n .-  k t\nlqelther  5 g.-   r -\nl . f .iftifxln 7\n0 o .   1 r \nmlcws. t  \n, naryviiit? l n .\n. -. . wk   \n, gut-xi  it s .\nx  trilwgitata, 9 a\n5 31f- t :3 v t -. .\n.  i,.!:5 t .l a\n v  f law-jam i, if\n. a. :wzgngmfn: ,v.\nv . 5 . r it w : . t a\nv0 gag furl 1. l\nt  23.5,:m-f  r . ,\n4 .- l 3.1:, -,n:.n. v v  ,\n-. a  f   gig?   . f. -\n : 3.x:   24-355   l .t .\n93:13?!   .\n. i v   :grgcif tars-17 w b ,v\n,, fwtwgiwrvg: : i v k .r l, \n: gatikzn . .w 1:3: :3:- t. . at . .\nv 53- vir  1571:. u\n\n

我知道Tesseract本身会进行很多预处理。我只是想知道是什么导致如此巨大的差异。作为参考,如果有区别,请在iPhone上拍摄图像。 我也正在使用此角色白名单

        tesseract.charWhitelist = "abcdefghijklomnopqrstuvwxyz.,!?-:+1234567890"

感谢您的帮助!

0 个答案:

没有答案