如何使用perl将非latin1数据发布到非UTF8站点?

时间:2010-06-03 00:40:01

标签: perl encoding post webforms

我想使用LWP :: UserAgent在CP1251网站上发布俄文,并获得以下结果:

# $text="Русский текст"; obtained from command line
FIELD_NAME => $text                                # result: Г?в г'В?г'В?г'В?г?вєг?вёг?в? Г'В'Г?вчг?вєг'В?г'В'
$text=Encode::decode_utf8($text);
FIELD_NAME => $text                                # result: Р с?с?с?рєрёр? С'Рчрєс?с'
FIELD_NAME => Encode::encode("cp1251", $text)     # result: Г?гіг+г+гЄгёгЏ ГІгҐгЄг+гІ
FIELD_NAME => URI::Escape::uri_escape_utf8($text) # result: D0%a0%d1%83%d1%81%d1%81%d0%ba%d0%b8%d0%b9%20%d1%82%d0%b5%d0%ba%d1%81%d1%82

我该怎么做?内容类型必须为x-www-form-urlencoded。你可以找到类似的here形式,但是你可以使用&#...来逃避任何非拉丁字符。表单,尝试在FIELD_NAME中转义它会导致10561091108910891 10901077108210891(每&#;从字符串中删除)或1056;усский текст (字符串开头的标点符号被删除)取决于FIELD_NAME实际上是什么。

UPDATE :有人知道如何转换以下代码,以便它将使用LWP :: UserAgent :: post函数吗?

my $url=shift;
my $fields=shift;
my $request=HTTP::Request->new(POST => absURL($url));
$request->content_type('application/x-www-form-urlencoded');
$request->content_encoding("UTF-8");
$ua->prepare_request($request);
my $content="";
for my $k (keys %$fields) {
    $content.="&" if($content ne "");
    my $c=$fields->{$k};
    eval {$c=Encode::decode_utf8($c)};
    $c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
    $content.="$k=".URI::Escape::uri_escape($c);
}
$request->content($content);
my $response=$ua->simple_request($request);

此代码实际上解决了这个问题,但我不想添加第三个请求包装函数(与getpost一起)。

3 个答案:

答案 0 :(得分:0)

如果你有可用的recode系统命令,那么使用recode系统命令的方法似乎是(远非最好的)。来自http://const.deribin.com/files/SignChanger.pl.txt

my $boardEncoding="cp1251"; # encoding used by the board
$vals{'Post'} = `fortune $forunePath | recode utf8..$boardEncoding`;
$res = $ua->post($formURL,\%vals);

另一种方法似乎是http://mail2lj.nichego.net/lj.txt

my        $formdata = $1 ;
my        $hr = ljcomment_string2form($formdata) ;
my        $req = new HTTP::Request('POST' => $ljcomment_action)
        or die "new HTTP::Request(): $!\n" ;

$hr->{usertype} = 'user' ;
$hr->{encoding} = $mh->mime_attr('content-type.charset') ||
                  "cp1251" ;
$hr->{subject}  = decode_mimewords($mh->get('Subject'));
$hr->{body} = $me->bodyhandle->as_string() ;

$req->content_type('application/x-www-form-urlencoded');
$req->content(href2string($hr)) ;

my      $ljres = submit_request($req, "comment") ;

if ($ljres->{'success'} eq "OK") {
    print STDERR "journal updated successfully\n" ;
} else {
    print STDERR "error updating journal: $ljres->{errmsg}\n" ;
    send_bounce($ljres->{errmsg}, $me, $mh->mime_attr("content-type.charset")) ;
}

答案 1 :(得分:0)

使用WWW::Mechanize,它会自动处理编码(包括字符编码和表单编码),如果form元素的accept-charset属性设置得当,它会做正确的事情。如果缺少,则表单默认为UTF-8,因此需要更正。你似乎处于这种情况。顺便说一下,您的示例站点的编码是KOI8-R,而不是Windows-1251。工作示例:

use utf8;
use WWW::Mechanize qw();
my $message = 'Русский текст';
my $mech = WWW::Mechanize->new(
    cookie_jar => {},
    agent => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.9 SUSE/6.0.401.0-2.1 (KHTML, like Gecko)',
);
$mech->get('http://zhurnal.lib.ru/cgi-bin/comment?COMMENT=/z/zyx/index_4-1');
$mech->current_form->accept_charset(scalar $mech->response->content_type_charset);
$mech->submit_form(with_fields => { TEXT => $message });

HTTP转储(仅限基本部分):

POST /cgi-bin/comment HTTP/1.1
Content-Length: 115
Content-Type: application/x-www-form-urlencoded

FILE=%2Fz%2Fzyx%2Findex_4-1&MSGID=&OPERATION=store_new&NAME=&EMAIL=&URL=&TEXT=%F2%D5%D3%D3%CB%C9%CA+%D4%C5%CB%D3%D

答案 2 :(得分:-1)

这些功能解决了这个问题(首先发布application / x-www-form-urlencoded数据,第二个发布multipart / form-data):

#{{{2 postue
sub postue($$;$) {
    my $url=shift;
    my $fields=shift;
    my $referer=shift;
    if(defined $referer and $referer eq "" and defined $fields->{"DIR"}) {
        $referer=absURL($url."?DIR=".$fields->{"DIR"}); }
    else {
        $referer=absURL($referer); }
    my $request=HTTP::Request->new(POST => absURL($url));
    $request->content_type('application/x-www-form-urlencoded');
    $request->content_encoding("UTF-8");
    $ua->prepare_request($request);
    my $content="";
    for my $k (keys %$fields) {
        $content.="&" if($content ne "");
        my $c=$fields->{$k};
        if(not ref $c) {
            $c=Encode::decode_utf8($c) unless Encode::is_utf8($c);
            $c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
            $c=URI::Escape::uri_escape($c);
        }
        elsif(ref $c eq "URI::URL") {
            $c=$c->canonical();
            $c=URI::Escape::uri_escape($c);
        }
        $content.="$k=$c";
    }
    $request->content($content);
    $request->referer($referer) if(defined $referer);
    my $i=0;
    print STDERR "Doing POST request to url $url".
        (($::o_verbose>2)?(" with fields:\n".
                ::YAML::dump($fields)):("\n"))
        if($::o_verbose>1);
  REQUEST:
    my $response=$ua->simple_request($request);
    $i++;
    my $code=$response->code;
    if($i<=$o_maxtries and 500<=$code and $code<600) {
        print STDERR "Failed to request $url with code $code... retrying\n"
            if($::o_verbose>2);
        sleep $o_retryafter;
        goto REQUEST;
    }
    return $response;
}
#{{{2 postfd
sub postfd($$;$) {
    my $url=absURL(shift);
    my $content=shift;
    my $referer=shift;
    $referer=absURL($referer) if(defined $referer);
    my $i=0;
    print STDERR "Doing POST request (form-data) to url $url".
        (($::o_verbose>2)?(" with fields:\n".
                ::YAML::dump($content)):("\n"))
        if($::o_verbose>1);
    my $newcontent=[];
    while(my ($f, $c)=splice @$content, 0, 2) {
        if(not ref $c) {
            $c=Encode::decode_utf8($c) unless Encode::is_utf8($c);
            $c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
        }
        push @$newcontent, $f, $c;
    }
  POST:
    my $response=$ua->post($url, $newcontent,
                           Content_type => "form-data",
                           ((defined $referer)?(referer => $referer):()));
    $i++;
    my $code=$response->code;
    if($i<=$o_maxtries and 500<=$code and $code<600) {
        print STDERR "Failed to download $url with code $code... retrying\n"
            if($::o_verbose>2);
        sleep $o_retryafter;
        goto POST;
    }
    return $response;
}