tesseract ocr在ios上很糟糕(7)

时间:2013-10-20 21:56:57

标签: ios7 ocr tesseract text2image

我不知道我或者tesseract库是否有问题,但它的工作非常糟糕。

Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];

    [tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZéèô" forKey:@"tessedit_char_whitelist"]; //limit search
    [tesseract setImage:[UIImage imageNamed:@"sampledoc.jpg"]]; //image to check
    [tesseract recognize];

    NSLog(@"%@", [tesseract recognizedText]);

    [tesseract clear];

这是我要从中提取文字的示例图片:

enter image description here

这就是我跑完后得到的:

THE SILVER CHAIR
by r 5 Lawn
CHAPTER ow
BEHIND THE cm

lr W1C a dull aulumn day and llll Pole vmscrylng ulmo mo gym
She ms clymg because Illey had been bullymg her Hus Is not gmng In baa school oolyl se I
shall say 15 lane is poslble Ibvlll lllrs schwll which lsnol 1 plusinl subjzrl II was Tcr
eduummlr o sdsooV rm bolh boysuld glrlsl Mm used no he cnllcd o wmxodl schonll some
said on wax ml nculy so mixed as an mlndsohhe people whn an n These penple had um mu
m boyund glrlsshauld loeullma mdn who my mo And unlonunalcb mm ml or
mom aflhc hlggzsl bays mo girls liked best was bullying Ihe mm All suns orlllmgsl hound
mmgso went on Much u an nvdmlry saloon wnuld mm bum flwnd om ma snowed m lulfn
R1my hm al Ilus school xhcy vlucrfl Or mu Iflhcy mo mo people who am am wxc not
expellad m pomsloa The mm no they Mile lntntesilng psycholoycnl msxs mdsaul for
them and mm mlhem for hnun Mo Ifyml knew lhe nghl sorlofdnngxmsay In mo um
mo maul result wos um vou became mlhev 1 fmounlelhan olllnrwlsc
no mswmy ml Pole W crymg on ml dull autumn my on me dlmp Vmlc pith Much runs
bellman um um arm gym ma Ihe lhvubbezy mm ole mam nearly nmulea her ay whan
boy came round Ihz oomuonhogym Mxmlmg mm ms lnmlds m ms pocktu I12 mm In
lmo nu
 CuIV yuu look when yolfre gomw ma JIH Fob
Mu nglur sud me km won mam man a and am he mom hen rm ll WV Polef he
not was upv
ml only mndc lung the am you mm mo yodic llymg oo my somclhmg um um Ihn lfyou
spnk you1l smrl ctymg owl
 lfs mum I suww l as mualr sand me hwy Mlmlbx ouggmg ms hlnds nmm mm ms vovkals
ml waded Them wlsw moo forhurm sly llH1hVlIgoCVOllWiIE ooolo have Said u They both
knew
wow laok has said the beyl Wherek no gond us all r
He mezm WEIL am he am mlk mum mo mlnmne begmnmg n lecmne ml suddenly liew mm a
lmxpcr hvmdl Isqnllc Illkcly llllng Io hlppen Ifyou law been mmrupled in n cryl

I

我应该做什么?

2 个答案:

答案 0 :(得分:1)

他的意思是像素分辨率(PPI),而不是图像尺寸。

我将图像重新缩放(从96 DPI)到300 DPI,几乎可以正确识别所有文本。在OCR步骤之前,图像肯定需要预处理。

答案 1 :(得分:0)

Tesseract *tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
[tesseract setImage:chosenImage];
[tesseract recognize];

NSLog(@"%@",[tesseract recognizedText]);