LWP登录网站的问题

时间:2014-10-12 18:11:10

标签: perl www-mechanize lwp

我是LWP的新手并感谢所有的帮助。我正在编写一个小的perl脚本来登录网站并下载文件。该过程与浏览器完美配合,但不通过LWP。使用浏览器,过程是

  1. 通过身份验证(用户名,密码)登录网站
  2. 成功登录后,wesbite会加载另一页
  3. 然后,可以访问“下载”页面并下载文件
  4. 如果其中一个未登录并尝试访问下载页面,则为该网站 加载“注册”页面以创建登录。
  5. 此过程适用于浏览器。 URL和用户/通行证是真实的,因此您可以在网站上使用代码中的详细信息进行尝试

    但是,使用脚本,我获得了成功代码,但网站不允许访问步骤2或3.而不是下载文件,下载注册页面。我怀疑这意味着登录无法使用脚本。

    非常感谢所有帮助完成这项工作

    以下代码

    #!/usr/bin/perl -w
    use strict;
    use warnings;
    
    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Cookies;
    use HTTP::Request;
    use WWW::Mechanize;
    
    my $base_url = "http://www.eoddata.com/default.aspx";
    my $username = 'xcytt';
    my $password = '321pass';
    
    # create a cookie jar on disk
    my $cookies = HTTP::Cookies->new(
        file     => 'cookies1.txt',
        autosave => 1,
    );
    
    my $http = LWP::UserAgent->new();
    $http->cookie_jar($cookies);
    
    my $login = $http->post(
        'http://www.eoddata.com/default.aspx',
        Content => [
            username => $username,
            password => $password,
        ]
    );
    
    # check if log in succeeded
    
    if ( $login->is_success ) {
        print "The response from server is " . $login->status_line . "\n\n";
        print "The headers in the response are \n" . $login->headers()->as_string() . "\n\n";
        print "Logged in Successfully\n\n";
        print "Printing cookies after successful login\n\n";
        print $http->cookie_jar->as_string() . "\n";
        my $url = "http://www.eoddata.com/Data/symbollist.aspx?e=NYSE";
        print "Now trying to download " . $url . "\n\n";
    
        # make request to download the file
        my $file_req = HTTP::Request->new( 'GET', $url );
        print "Printing cookies before file download request\n\n";
        print $http->cookie_jar->as_string() . "\n";
        my $get_file = $http->request($file_req);
    
        # check request status
        if ( $get_file->is_success ) {
            print "The response from server is " . $get_file->status_line . "\n\n";
            print "The headers in the response are " . $get_file->headers()->as_string() . "\n\n";
            print "Downloaded $url, saving it to file ...\n\n";
            open my $fh, '>', 'tmp_NYSE.txt' or die "ERROR: $!n";
            print $fh $get_file->decoded_content;
            close $fh;
        } else {
            print "File Download failure\n";
        }
    } else {
        print "Login Error\n";
    }
    

    脚本输出:

    The response from server is 200 OK
    
    The headers in the response are 
    Cache-Control: private
    Date: Sun, 12 Oct 2014 17:43:47 GMT
    Server: Microsoft-IIS/7.5
    Content-Length: 39356
    Content-Type: text/html; charset=utf-8
    Client-Date: Sun, 12 Oct 2014 17:43:48 GMT
    Client-Peer: 64.182.238.14:80
    Client-Response-Num: 1
    Link: <styles/jquery-ui-1.10.0.custom.min.css>; rel="stylesheet"; type="text/css"
    Link: <styles/main.css>; rel="stylesheet"; type="text/css"
    Link: <styles/button.css>; rel="stylesheet"; type="text/css"
    Link: <styles/nav.css>; rel="stylesheet"; type="text/css"
    Link: </styles/colorbox.css>; rel="stylesheet"; type="text/css"
    Link: </styles/slides.css>; rel="stylesheet"; type="text/css"
    Set-Cookie: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path=/; HttpOnly
    Title: End of Day Stock Quote Data and Historical Stock Prices
    X-AspNet-Version: 4.0.30319
    X-Meta-Description: Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, SGX, HKEX, and FOREX.
    X-Meta-Keywords: metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,historical share prices,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price,stock futures
    X-Meta-Verify-V1: cT9ZK5uSlR3GrcasqgUh7Yh3fnuRGsRY1IRvE85ffa0=
    X-Powered-By: ASP.NET
    
    
    Logged in Successfully
    
    Printing cookies after successful login
    
    Set-Cookie3: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path="/"; domain=www.eoddata.com; path_spec; discard; HttpOnly; version=0
    
    Now trying to download http://www.eoddata.com/Data/symbollist.aspx?e=NYSE
    
    Printing cookies before file download request
    
    Set-Cookie3: ASP.NET_SessionId=cjgm4oscl1xmlzwnzql4gcns; path="/"; domain=www.eoddata.com; path_spec; discard; HttpOnly; version=0
    
    The response from server is 200 OK
    
    The headers in the response are Cache-Control: private
    Date: Sun, 12 Oct 2014 17:43:48 GMT
    Server: Microsoft-IIS/7.5
    Content-Length: 49880
    Content-Type: text/html; charset=utf-8
    Client-Date: Sun, 12 Oct 2014 17:43:49 GMT
    Client-Peer: 64.182.238.14:80
    Client-Response-Num: 1
    Link: <styles/jquery-ui-1.10.0.custom.min.css>; rel="stylesheet"; type="text/css"
    Link: <styles/main.css>; rel="stylesheet"; type="text/css"
    Link: <styles/button.css>; rel="stylesheet"; type="text/css"
    Link: <styles/nav.css>; rel="stylesheet"; type="text/css"
    Title: Member Registration
    X-AspNet-Version: 4.0.30319
    X-Meta-Description: Register now for Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, ASX, SGX, HKEX, and FOREX.
    X-Meta-Keywords: metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price
    X-Powered-By: ASP.NET
    
    
    Downloaded http://www.eoddata.com/Data/symbollist.aspx?e=NYSE, saving it to file ...
    

    浏览器的标题是:

    http://www.eoddata.com/myaccount/default.aspx
    
    GET /Data/symbollist.aspx?e=NYSE HTTP/1.1
    Host: www.eoddata.com
    User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:32.0) Gecko/20100101 Firefox/32.0
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    Accept-Language: en-US,en;q=0.5
    Accept-Encoding: gzip, deflate
    Cookie: ASP.NET_SessionId=uvnqhzpzco1wpe300egm4hqj; __utma=264658075.1162754774.1412987203.1413069850.1413137050.4; __utmc=264658075; __utmz=264658075.1412987203.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _cb_ls=1; _chartbeat2=DMtSRyBOnGNFDptR86.1412466246942.1413137060190.10011111; _chartbeat_uuniq=3; EODDataAdmin=D838F9AA985E247A47493320CC8DC14950FA6CE49C6E1079DCFA95F632CEA7A2A6A691B352C544D41D0C208077D0C23897C9EA6EF0FE9221833A7131C334A657A48F5001BF2EBDE073D98BE4FD5719943AAC94D7C3DAA5A422FD575C663C337C93D5046AF3F7987998EDD60347531460FC54DEC81394352D9EDA00B7C954CC3304BC7D4C30D1F3A82C0EE58B890E0765; __utmb=264658075.2.10.1413137050; __utmt=1
    Connection: keep-alive
    
    HTTP/1.1 200 OK
    Cache-Control: private
    Transfer-Encoding: chunked
    Content-Type: text/plain; charset=utf-8
    Server: Microsoft-IIS/7.5
    Content-Disposition: attachment;filename=NYSE.txt
    X-AspNet-Version: 4.0.30319
    X-Powered-By: ASP.NET
    Date: Sun, 12 Oct 2014 18:05:24 GMT
    

    下载的文件片段不是我想要的输出。请注意,标题是&#34;会员注册&#34;而不是我期待的数据文件

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head><link rel="stylesheet" href="styles/jquery-ui-1.10.0.custom.min.css" type="text/css" /><link rel="stylesheet" href="styles/main.css" type="text/css" /><link rel="stylesheet" href="styles/button.css" type="text/css" /><link rel="stylesheet" href="styles/nav.css" type="text/css" />
    <script src="../scripts/jquery-1.9.0.min.js" type="text/javascript"></script>
    <script src="../scripts/jquery-ui-1.10.0.custom.min.js" type="text/javascript"></script>
    <script type="text/javascript">     var _sf_startpt = (new Date()).getTime()</script>
    <meta name="keywords" content="metastock eod,free eod,free eod data,eod download,stock,exchange,data,historical stock quotes,free,download,day,end,prices,market,chart,NYSE,NASDAQ,AMEX,FTSE,FOREX,ASX,SGX,NZSE,tsx stock,stock share prices,stock ticker symbol,daily prices,daily stock,historic stock price" />
    <meta name="description" content="Register now for Free end of day stock market data and historical quotes for many of the world's top exchanges including NASDAQ, NYSE, AMEX, TSX, OTCBB, FTSE, ASX, SGX, HKEX, and FOREX." />
    <title>
    Member Registration
    </title></head>
    

3 个答案:

答案 0 :(得分:1)

大多数use语句都是不必要的,因为LWP通常会提供所需的任何模块。

如果您使用LWP::UserAgent,则当然不需要LWP::SimpleWWW::Mechanize,默认情况下,LWP会创建内存HTTP::Cookies对象。

问题很可能是您从网站上获取的HTML包含的JavaScript代码在被检索之后修改了它。 LWP不会为您效仿,因此页面保持与从网站发送的页面一样。

没有好的解决方案,但WWW::Mechanize::Firefox允许您从Perl代码中驱动已安装的Firefox浏览器,并且可以满足您的需求。

答案 1 :(得分:1)

您的登录代码未登录 - 您发布的数据与登录表单的输入不同。

使用WWW::Mechanize&#39; mech-dump检查http://www.eoddata.com/default.aspx表单的内容,显示以下内容:

POST http://www.eoddata.com/default.aspx [aspnetForm]
  ctl00_tsm_HiddenField=         (hidden readonly)
  __VIEWSTATE=/wEPDwUJNTgzMTIzMjMyD2QWAmYPZBYCAgMPZBYCAgcPZBYCAh0PZBYEAgMPZBYCAgcPDxYCHgRUZXh0ZWRkAgcPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRpjdGwwMCRjcGgxJGxnMSRjaGtSZW1lbWJlcuq72b0jSSSEoSOAcZlLZzWMmsYqjOMTbPl/Op1ToVKf (hidden readonly)
  __VIEWSTATEGENERATOR=CA0B0334  (hidden readonly)
  __PREVIOUSPAGE=72Ep8BrmYqNbOSb65afxljULshovHpRLBJcMC0funBrM2g0qkkpORQb_wqNsu_2SbA5JbxbwNkpXlR_SZWwgPwwbGdBP4YGDoNJCDtPRQS81 (hidden readonly)
  __EVENTVALIDATION=/wEdAAvsaJw1zF2h8PWbp8tJHjaFx+CzKn9gssNaJswg1PWksJd223BvmKj73tdq9M98Zo0JWPh42opnSCw9zAHys7YwDyn98qMl4Da8RNKOYtjmMtj1Nek/A8Dky1WNDflwB7GO1vgbcIR7aON1c4Cm5wJw0r2yvex8d7TohORX6QMo1j8IRvmRE3IYRPV0S4fj4csX1838LMsOJxqMoksh8zNIRuOmXf1pY8AyXSwvWgp1mYRx4mHFI6oep3qpPKhhA22Mc6tB5KOFIqkGgyvucIby (hidden readonly)
  ctl00$Menu1$s1$txtSearch=      (text)
  ctl00$Menu1$s1$btnSearch=Search (submit)
  ctl00$cph1$btns1=CLICK HERE    (submit)
  ctl00$cph1$btns2=CLICK HERE    (submit)
  ctl00$cph1$btns3=CLICK HERE    (submit)
  ctl00$cph1$lg1$txtEmail=       (text)
  ctl00$cph1$lg1$txtPassword=    (password)
  ctl00$cph1$lg1$chkRemember=<UNDEF> (checkbox) [*<UNDEF>/off|on]
  ctl00$cph1$lg1$btnLogin=Login  (submit)

您的POST请求需要从上面的表单中设置相应的字段才能成功登录到服务器,除非有某个文档明确指出您用于登录的方法是有效的(我没有搜索到该网站检查这个。)

我有点作弊并使用Chrome浏览器面板中的数据创建了有效的登录请求(而不是使用WWW :: Mechanize来填充表单或自己创建请求)。有了这个,我就能够登录并下载文件:

my $resp = $http->post(
    'http://www.eoddata.com/default.aspx',
    Content => 'ctl00_tsm_HiddenField=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=%2FwEPDwUJNTgzMTIzMjMyD2QWAmYPZBYCAgMPZBYCAgcPZBYCAh0PZBYEAgMPZBYCAgcPDxYCHgRUZXh0ZWRkAgcPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBRpjdGwwMCRjcGgxJGxnMSRjaGtSZW1lbWJlcuq72b0jSSSEoSOAcZlLZzWMmsYqjOMTbPl%2FOp1ToVKf&__VIEWSTATEGENERATOR=CA0B0334&__PREVIOUSPAGE=72Ep8BrmYqNbOSb65afxljULshovHpRLBJcMC0funBrM2g0qkkpORQb_wqNsu_2SbA5JbxbwNkpXlR_SZWwgPwwbGdBP4YGDoNJCDtPRQS81&__EVENTVALIDATION=%2FwEdAAvsaJw1zF2h8PWbp8tJHjaFx%2BCzKn9gssNaJswg1PWksJd223BvmKj73tdq9M98Zo0JWPh42opnSCw9zAHys7YwDyn98qMl4Da8RNKOYtjmMtj1Nek%2FA8Dky1WNDflwB7GO1vgbcIR7aON1c4Cm5wJw0r2yvex8d7TohORX6QMo1j8IRvmRE3IYRPV0S4fj4csX1838LMsOJxqMoksh8zNIRuOmXf1pY8AyXSwvWgp1mYRx4mHFI6oep3qpPKhhA22Mc6tB5KOFIqkGgyvucIby&ctl00%24Menu1%24s1%24txtSearch=&ctl00%24cph1%24lg1%24txtEmail=xcytt&ctl00%24cph1%24lg1%24txtPassword=321pass&ctl00%24cph1%24lg1%24btnLogin=Login' );

if ($resp->is_success) {    
    my $get_file = $http->get("http://www.eoddata.com/Data/symbollist.aspx?e=NYSE");
}

转储$get_file的内容,为我提供了符合预期的符号和公司名称列表。

您可以使用WWW :: Mechanize填写表单字段,也可以从http://www.eoddata.com/default.aspx中抓取表单输入值(特别是隐藏字段,每次加载时都会更改),然后创建POST请求使用这些值和您的登录凭据。

另请注意,完全可以从服务器获得成功的响应,而无需执行您想要的操作(例如登录)。重定向和页面使用&#34;登录失败&#34;将被LWP :: UA视为成功。

答案 2 :(得分:0)

如果有人对这个问题感兴趣,我又看了一遍,发现只用LWP就可以了。但是,WWW::Mechanize的工具使得使用HTML表单变得更加简单

这是使用提供的凭据登录页面的程序。作为一个ASP页面,它有可怕的输入名称。例如,用户名和密码字段以及登录按钮的名称分别为ctl00$cph1$lg1$txtEmailctl00$cph1$lg1$txtPasswordctl00$cph1$lg1$btnLogin。我已经使用HTML::Form方法直接使用正则表达式来定位这些输入字段,我认为这使得代码更清晰

我已经显示登录后到达的HTML页面的标题,以证明它正在运行

use strict;
use warnings;

use WWW::Mechanize;

my $base_url = 'http://www.eoddata.com/default.aspx';
my $username = 'xcytt';
my $password = '321pass';

my $mech = WWW::Mechanize->new;

$mech->get($base_url);

my $form = $mech->form_id('aspnetForm');

my @inputs  = $form->inputs;
my ($email) = grep $_->name =~ /Email/,    @inputs;
my ($pass)  = grep $_->name =~ /Password/, @inputs;
my ($login) = grep $_->name =~ /Login/,    @inputs;

$email->value($username);
$pass->value($password);
$mech->click_button(value => 'Login');

print $mech->title, "\n";

输出

EODData - My Download