Question

我需要从url中提取验证码，并用Tesseract识别出来。我的代码是：

#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
$html = `GET "$url"`
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
    $img = $1;
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$home$img' > ocr_me.img\n";
system "GET '$home$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;

图像正确解析。该图像包含验证码，看起来像：

我的输出是：

GET 'http://perltest.adavice.com/captcha/1533110309.png' > ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
GET '' > ocr_result.txt
Captcha text not specified

如您所见，脚本可以正确解析图像。但是Tesseract在该PNG文件中没有看到任何内容。我正在尝试使用shell命令tesseract指定其他参数，例如-psm和-l，但这也无济于事

更新：阅读@Dave Cross的答案后，我尝试了他的建议。

在输出中，我得到了：

http://perltest.adavice.com/captcha/1533141024.png
ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[]
200Captcha text not specified
Original image file not specified
Captcha text not specified

为什么我需要图像.PNG中的文本？也许这些附加信息可以为您提供帮助。看那个：

这是$ url在浏览器中的样子。我的目标是使用perl在wim中为此页面创建查询。为此，我需要填写$ user，$ pass和$ txt（来自Tesseract图像识别的）上方的表格。并使用POST“ url”（代码中的最后一个字符串）发送。

Answer 1

这里发生了几件奇怪的事情。其中任何一个都可能引起您的问题。

在您的排行中使用-X是一个糟糕的主意。它显式关闭警告。我建议您删除它，在代码中添加use warnings并解决所有揭示的问题（我建议也添加use strict，但您需要声明所有变量）。
我建议您使用LWP::Simple而不是使用GET。
请不要使用正则表达式来解析HTML。而是使用真实的HTML解析器。 Web::Query是我目前的最爱。
然后，使用名为GET的没有值的变量再次运行$txt。那是行不通的！
$txt = 'cat ocr_result.txt'并没有您认为的那样。您需要反引号，而不是单引号。

更新：很显然，我无权访问您的用户名或密码，因此无法重构所有代码。但这似乎可以很好地访问示例中的图像并从中提取文本。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

use LWP::Simple;

my $img_url  = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

my $txt = `tesseract $img_file stdout`;

say $txt;

这是您的实际错误：

system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";

您要求tesseract将其输出写入ocr_result.txt，但是两行之后，您用对GET的失败调用输出覆盖了该文件。我不确定您会怎么做，但是它将丢弃已存储在该文件中的所有输出tesseract。

更新后的更新：

这是我当前的代码版本：

#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
###
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
###
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
    $img = $1;
}
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';

getstore($img_url, $img_file);

say $img_url;
say $img_file;

# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);
chomp($txt);

say "[$txt]";

$txt =~ s/\W+//g;

my $resp = $ua->post($url, {
  u    => $user,
  p    => $pass,
  file => basename($img),
  text => $txt,
});

print $resp->code;
print $resp->content;

我做了几件事

将$img_url从$url . $img修正为$home . $img（这是阻止其获取正确图像的原因）。
切换到整个过程都使用LWP :: Simple（这很容易）。
chomp编辑了tesseract的输出（两次！），以删除换行符。
使用File :: Basename获取正确的文件名，以传递到最后的POST。
$txt之前，从POST中删除了所有非单词字符。

它仍然无法正常工作。它似乎挂起，等待服务器的响应。但恐怕我没有时间来帮助您。

Tesseract无法识别png文件中的验证码，该文件包含英文字母的数字和字母

1 个答案: