尝试使用LWP :: UserAgent从http://www.firstgiving.com抓取JSON数据

时间:2011-12-13 21:36:55

标签: json perl lwp-useragent

正如你们中的一些人可能已经听过的那样,目前正在进行一些慈善活动,特别是r / atheism。为了帮助/鼓励筹款,我开始编写一个小的Web实用程序来提供有关这些捐赠的实时信息(基本上是来自Reddit的数据来自FirstGiving的数据) - 你可以看到我到目前为止所拥有的here - 它只显示每个subreddit的总数和平均数字,这是非常初步的(也不是很漂亮。)

我想添加的功能是FirstGiving似乎无法提供的功能,即搜索或链接到特定捐赠的功能。上周有很多帖子,人们试图提供捐赠匹配和类似,但也有很多假/巨魔帖子,并没有很好的方法来验证是否有人“提供”(我们都知道截图很容易伪造。)我计划从FirstGiving缓存数据,以允许某人链接到

检查了FirstGiving页面后,似乎有一个未记录的JSON API调用(滚动到页面底部以显示更多捐赠时使用),它将返回捐赠金额,消息和昵称列表作为HTML表格。根据Opera Dragonfly的说法,这是我在浏览器(Opera)中访问时的样子:

URL:    http://www.firstgiving.com/ProfileWebApi/Donations
Method: POST
Status: 200 OK
Duration:   1220 ms

请求详情

POST /ProfileWebApi/Donations HTTP/1.1 
User-Agent: Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60
Host: www.firstgiving.com
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate
Referer: http://www.firstgiving.com/fundraiser/r-atheism/ratheism
Cookie: ASP.NET_SessionId=rmsl4b45jdxwykanpoqkb255
Connection: Keep-Alive
Content-Length: 111
Content-Type: application/json;
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Content-Transfer-Encoding: binary
Request body
{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":false,"PageNumber":4,"PageSize":50}
Response details
HTTP/1.1 200 OK 
Cache-Control: private
Content-Length: 62979
Content-Type: application/json; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNetMvc-Version: 2.0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Tue, 13 Dec 2011 19:13:28 GMT

车身

{"Data":"\u0009\u000d\u000a\u0009\u0009\u0009\u0009\u000d\u000a                         <table class=\"donationTable collapsed\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style='height:0px; overflow:hidden;' >\u000d\u000a                            <thead class=\"visuallyhidden\">\u000d\u000a\u0009\u0009                        <tr>\u000d\u000a                                    <th scope=\"col\">Comment<\/th>\u000d\u000a                                    <th scope=\"col\" class=\"amount\">Donation<\/th>\u000d\u000a                                <\/tr>\u000d\u000a                            <\/thead>\u000d\u000a\u0009\u0009\u0009            \u000d\u000a                            <tr>                              \u000d\u000a                                  <td class=\"comment\">\u000d\u000a                                            \u000d\u000a                                                    <strong>Dear Regan Layman<\/strong>\u000d\u000a                                                Happy holidays :)<br \/>\u000d\u000a                                            \u000d\u000a                                                <time datetime=\"2011-12-10T21:55:35.0000000\">\u000d\u000a                                                    12\/10\/2011\u000d\u000a                                                <\/time>\u000d\u000a                                            \u000d\u000a                                   <\/td>\u000d\u000a                               \u000d\u000a                              <td class=\"amount\">\u000d\u000a                                $20.00<sup style=\"font-size:10px;\" title=\"Offline donation\"><\/sup> \u000d\u000a                                \u000d\u000a                              <\/td>\u000d\u000a                        <\/tr>\u000d\u000a\u0009                \u000d\u000a                            <tr>                              \u000d\u000a                                  <td class=\"comment\">\u000d\u000a                                            \u000d\u000a                                                    <strong>Frodo Baggins<\/strong>\u000d\u000a                                                Due to the fact that doctors heal people, not God!<br \/>\u000d\u000a                                            \u000d\u000a                                                <time datetime=\"2011-12-10T21:52:11.0000000\">\u000d\u000a                                                    12\/10\/2011\u000d\u000a                                                <\/time>\u000d\u000a                                            \u000d\u000a                                   <\/td>\u000d\u000a                               \u000d\u000a                              <td class=\"amount\">\u000d\u000a                                $4.64<sup style=\"font-size:10px;\" title=\"Offline donation\"><\/sup> \u000d\u000a                                \u000d\u000a                              <\/td>\u000d\u000a                        <\/tr>\u000d\u000a\u0009                \u000d\u000a                            

(剪掉了响应体的其余部分。另外,通常有更多的cookie,但我手动删除除了aspsession id以外的所有内容,并且它正常工作,因此它们似乎与除分析等之外的任何内容无关)

但是,当我尝试从perl脚本执行相同的操作时,我没有得到这个有用的输出。这是我的剧本:

#!/usr/bin/perl -w

use LWP::Simple;
use JSON;

use HTTP::Cookies;
use LWP::UserAgent;

use Data::Dumper;

my $cookie_jar = HTTP::Cookies->new;
my $ua = LWP::UserAgent->new(cookie_jar => $cookie_jar);
#push @{ $ua->requests_redirectable }, 'POST';
$ua->get('http://www.firstgiving.com/fundraiser/r-atheism/ratheism');

print Dumper $cookie_jar;

my $req = HTTP::Request->new(
    'POST',
    'http://www.firstgiving.com/ProfileWebApi/Donations');
$req->header('Accept-Encoding' => 'gzip, deflate');
$req->header('Referer' => 'http://www.firstgiving.com/fundraiser/r-atheism/ratheism');
$req->header('X-Requested-With' => 'XMLHttpRequest');
$req->header('Content-Transfer-Encoding' => 'binary');
$req->header('Content-type:' => 'application/json');
$req->header('User-Agent' => 'Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60');
$req->content('{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":true,"PageNumber":2,"PageSize":50}');
#$req->content('{"EventGivingGroupId":1476950,"PageNumber":1,"PageSize":50}');

my $post_request = $ua->request($req);
print Dumper( ($post_request) );

这是输出:

$VAR1 = bless( {
                 'COOKIES' => {
                                'www.firstgiving.com' => {
                                                           '/' => {
                                                                    'ASP.NET_SessionId' => [
                                                                                             0,
                                                                                             'yynhqi2udtz4y055fakdvjiu',
                                                                                             undef,
                                                                                             1,
                                                                                             undef,
                                                                                             undef,
                                                                                             1,
                                                                                             {
                                                                                               'HttpOnly' => undef
                                                                                             }
                                                                                           ]
                                                                  }
                                                         }
                              }
               }, 'HTTP::Cookies' );
$VAR1 = bless( {
                 '_protocol' => 'HTTP/1.1',
                 '_content' => '<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="%2ferror%2f404">here</a>.</h2>
</body></html>
',
                 '_rc' => '302',
                 '_headers' => bless( {
                                        'x-powered-by' => 'ASP.NET',
                                        'client-response-num' => 1,
                                        'location' => '/error/404',
                                        'cache-control' => 'private',
                                        'date' => 'Tue, 13 Dec 2011 19:43:56 GMT',
                                        'client-peer' => '204.12.127.197:80',
                                        'x-aspnet-version' => '2.0.50727',
                                        'client-date' => 'Tue, 13 Dec 2011 19:36:45 GMT',
                                        'x-aspnetmvc-version' => '2.0',
                                        'content-type' => 'text/html; charset=utf-8',
                                        'title' => 'Object moved',
                                        'client-transfer-encoding' => [
                                                                        'chunked'
                                                                      ],
                                        'server' => 'Microsoft-IIS/7.5'
                                      }, 'HTTP::Headers' ),
                 '_msg' => 'Found',
                 '_request' => bless( {
                                        '_content' => '{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":true,"PageNumber":2,"PageSize":50}',
                                        '_uri' => bless( do{\(my $o = 'http://www.firstgiving.com/ProfileWebApi/Donations')}, 'URI::http' ),
                                        '_headers' => bless( {
                                                               'cookie2' => '$Version="1"',
                                                               'user-agent' => 'Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60',
                                                               'cookie' => 'ASP.NET_SessionId=yynhqi2udtz4y055fakdvjiu',
                                                               'x-requested-with' => 'XMLHttpRequest',
                                                               'accept-encoding' => 'gzip, deflate',
                                                               'content-transfer-encoding' => 'binary',
                                                               'content-type:' => 'application/json',
                                                               'referer' => 'http://www.firstgiving.com/fundraiser/r-atheism/ratheism'
                                                             }, 'HTTP::Headers' ),
                                        '_method' => 'POST',
                                        '_uri_canonical' => $VAR1->{'_request'}{'_uri'}
                                      }, 'HTTP::Request' )
               }, 'HTTP::Response' );

如果我启用了行push @{ $ua->requests_redirectable }, 'POST';(即允许重定向POST),则重定向到404 error page

如果这是FirstGiving故意尝试阻止非人类客户,我当然会放弃,但他们的robots.txt似乎并没有禁止我正在做的事情。

1 个答案:

答案 0 :(得分:2)

添加Accept: application/json, text/javascript, */*; q=0.01标头。不是我通常认为是关键的标题,但在这种情况下似乎是。

我使用curl做了一个快速的小测试。这很有效:

curl -vv -H 'Content-Type: application/json' \
  -H 'Referer: http://www.firstgiving.com/fundraiser/r-atheism/ratheism' \
  -H 'Cookie: ASP.NET_SessionId=svqlde45h0cvrv55hqvhwv55;' \
  -H 'X-Requested-With: XMLHttpRequest' \
  -H 'Accept: application/json, text/javascript, */*; q=0.01' \
  -d '{"EventGivingGroupId":1476950,"TotalRaised":"191532.480000","PageIsExpired":false,"PageNumber":2,"PageSize":50}' \
  'http://www.firstgiving.com/ProfileWebApi/Donations'

这给了我重定向:

curl -vv -H 'Content-Type: application/json' \
  -H 'Referer: http://www.firstgiving.com/fundraiser/r-atheism/ratheism' \
  -H 'Cookie: ASP.NET_SessionId=svqlde45h0cvrv55hqvhwv55;' \
  -H 'X-Requested-With: XMLHttpRequest' \
  -d '{"EventGivingGroupId":1476950,"TotalRaised":"191532.480000","PageIsExpired":false,"PageNumber":2,"PageSize":50}' \
  'http://www.firstgiving.com/ProfileWebApi/Donations'