在perl脚本中使用shell命令tesseract来打印文本输出

时间:2015-03-28 19:52:24

标签: regex perl shell

嗨我有一些我想写的脚本,首先我从html中获取了图像,然后我想使用tesseract从中获取输出txt。 我真的不知道该怎么做。

以下是代码:

#!/usr/bin/perl -X


##########
$user = ''; # Enter your username here
$pass = ''; # Enter your password here
###########

# Server settings (no need to modify)
$home = "http://37.48.90.31";
$url  = "$home/c/test.cgi?u=$user&p=$pass";

# Get HTML code
$html = `GET "$url"`;

#### Add code here:
# Grab img from HTML code

if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/) {
    $takeImg = $1;
    }
@dirs = split m!/!, $takeImg;
$img = $dirs[2];
#########
die "<img> not found\n" if (!$img);


# Download img to server (save as: ocr_me.img)
print "GET '$img' > ocr_me.img\n";
system "GET '$img' > ocr_me.img";


#### Add code here:
# Run OCR (using shell command tesseract) on img and save text as       ocr_result.txt
system ("tesseract", "tesseract ocr_me.img ocr_result");


###########
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");

# Check OCR results:
$txt = `cat ocr_result.txt`;

我从html中获取了图像,或者我需要另一个正则表达式? 以及如何显示'ocr_result.txt'

感谢所有有帮助的人!

1 个答案:

答案 0 :(得分:0)

#!/usr/bin/perl -X
use LWP::Simple;

##########
my $user = ''; # Enter your username here
my $pass = ''; # Enter your password here
###########

# Server settings (no need to modify)
my $home = "http://37.48.90.31";
my $url  = "$home/c/test.cgi?u=$user&p=$pass";
# Get HTML Code
my $html = get($url);

#### Add code here:
# Grab img from HTML code

if ($html =~ /\img[^>]* src=\"([^\"]*)\"[^>]*/)
{
        my $takeImg = $1
        my @dirs = split('/', $takeImg);
        my $img = $dirs[2] or die "<img> not found\n";

        # Download img to server (save as: ocr_me.img)
        getstore($img,'ocr_me.img');

        #### Add code here:
        # Run OCR (using shell command tesseract) on img and save text as       ocr_result.txt
        system ("tesseract", "tesseract ocr_me.img ocr_result");


        ###########
        die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");

        # Check OCR results:
        open(FH, '<ocr_result.txt');
        print "$_\n" for(<FH>);
}
else
{
        print "Image not found\n";
}