如何获取html页面的标题

时间:2012-02-15 18:34:40

标签: html perl http-headers perl-module

我使用 HTTP :: Request 来获取xml提要的标题,但是当我打印标题标记或字段时,它正在打印空字符串。

    use HTTP::Request;
    use LWP::UserAgent;

    my $ua = LWP::UserAgent->new;
    my $request = HTTP::Request->new( GET => 'http://www.gmail.com' );

    print "Requesting...\n";
    my $response = $ua->request( $request );
    print "  Status: ", $response->status_line, "\n";
    print "  Last modified: ", $response->header( 'last-modified' ), "\n";
    print "  Etag: ", $response->header( 'etag' ), "\n\n";

是否有任何方法可以获取,上次修改网页时没有“最后修改过的”和“etag”等标记?

1 个答案:

答案 0 :(得分:1)

此特定网站似乎不会返回etaglast-modified标头密钥。

您可以使用dump

获取所有内容
my $request = HTTP::Request->new( GET => 'http://www.gmail.com' );
my $response = $ua->request( $request );
print $response->dump()."\n";

你得到:

HTTP/1.1 200 OK
Cache-Control: no-cache, no-store
Connection: close
Date: Wed, 15 Feb 2012 18:55:54 GMT
Pragma: no-cache
Server: GSE
Content-Type: text/html; charset=UTF-8
Expires: Mon, 01-Jan-1990 00:00:00 GMT
Client-Date: Wed, 15 Feb 2012 18:55:53 GMT
Client-Peer: 173.194.67.84:443
Client-Response-Num: 1
Client-SSL-Cert-Issuer: /C=ZA/O=Thawte Consulting (Pty) Ltd./CN=Thawte SGC CA
Client-SSL-Cert-Subject: /C=US/ST=California/L=Mountain View/O=Google Inc/CN=accounts.google.com
Client-SSL-Cipher: RC4-SHA
Client-SSL-Socket-Class: IO::Socket::SSL
Client-Transfer-Encoding: chunked
Link: <//mail.google.com/favicon.ico>; rel="icon"; type="image/ico"
Link: <https://plus.google.com/103345707817934461425>; rel="publisher"
Set-Cookie: GAPS=1:P5tYXkr9cvVAMBJZ_j8lm34_tvxOWQ:Wt3iF5PQ_mn8YVOj;Path=/;Expires=Fri, 14-Feb-2014 18:55:54 GMT;Secure;HttpOnly
Set-Cookie: GALX=N9ISXky4Eu8;Path=/;Secure
Strict-Transport-Security: max-age=2592000; includeSubDomains
Title: Gmail: Email from Google
X-Auto-Login: realm=com.google&args=service%3Dmail%26continue%3Dhttp%253A%252F%252Fmail.google.com%252Fmail%252F
X-Content-Type-Options: nosniff
X-Frame-Options: Deny
X-Meta-Charset: utf-8
X-Meta-Description: 7+ GB of storage, less spam, and mobile access.  is email that's intuitive, efficient, and useful. And maybe even fun.
X-XSS-Protection: 1; mode=block

<!DOCTYPE html>
<html lang="en">
  <head>
  <meta charset="utf-8">
  <title>Gmail: Email from Google</title>
  <meta name="description" content="7+ GB of storage, less spam, and mobile access.  is email that&#39;s intuitive, efficient, and useful. And maybe even fun.">
  <link rel="icon" type="image/ico" href="//mail.google.com/favicon.ico">
<style type="text/css">
  html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl,
  dt, dd, ol, ul, li, table, tr, td, form, object, embed,
  article, aside, canvas, comma...
(+ 54500 more bytes not shown)

...并且没有etag,也没有last-modified