Net / HTTPS无法获取所有内容

时间:2015-05-05 18:45:20

标签: ruby web-crawler nokogiri net-http mechanize-ruby

我需要通过爬虫登录Jenkins来收集一些数据,但与Jenkins的来源相比,Net / HTTPS得到的页面不完整,这两个来源都是:

Net / HTTPS'HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta http-equiv="refresh" content="1;url=/login?from=%2F">
  <script>
    window.location.replace('/login?from=%2F');
  </script>
</head>

<body style="background-color:white; color:white;">Authentication required</body>

</html>

Nokogiri的XML

=> #
<Nokogiri::HTML::Document:0x1a11444 name="document" children=[#<Nokogiri::XML::DTD:0x1a109b8 name="html">, #
  <Nokogiri::XML::Element:0x1a101ac name="html" children=[#<Nokogiri::XML::Element:0x2047ee4 name="head" children=[#<Nokogiri::XML::Element:0x2047d04 name="meta" attributes=[#<Nokogiri::XML::Attr:0x2047ca0 name="http-equiv" value="refresh">, #
    <Nokogiri::XML::Attr:0x2047c8c name="content" value="1;url=/login?from=%2F">]>, #
      <Nokogiri::XML::Element:0x2047660 name="script" children=[#<Nokogiri::XML::CDATA:0x2047480 "window.location.replace('/login?from=%2F');">]>]>, #
        <Nokogiri::XML::Element:0x20471ec name="body" attributes=[#<Nokogiri::XML::Attr:0x2047188 name="style" value="background-color:white; color:white;">] children=[#
          <Nokogiri::XML::Text:0x2046d50 "Authentication required">]>]>]>

詹金斯的消息来源

<!DOCTYPE html>
<html>

<head resURL="/static/98ff49d3">


  <title>Jenkins</title>
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/style.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/color.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/responsive-grid.css" />
  <link rel="shortcut icon" type="image/vnd.microsoft.icon" href="/static/98ff49d3/favicon.ico" />
  <script>
    var isRunAsTest = false;
    var rootURL = "";
    var resURL = "/static/98ff49d3";
  </script>
  <script src="/static/98ff49d3/scripts/prototype.js" type="text/javascript"></script>
  <script src="/static/98ff49d3/scripts/behavior.js" type="text/javascript"></script>
  <script src='/adjuncts/98ff49d3/org/kohsuke/stapler/bind.js' type='text/javascript'></script>
  <script src="/static/98ff49d3/scripts/yui/yahoo/yahoo-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/dom/dom-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/event/event-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/animation/animation-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/dragdrop/dragdrop-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/container/container-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/connection/connection-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/datasource/datasource-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/autocomplete/autocomplete-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/menu/menu-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/element/element-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/button/button-min.js"></script>
  <script src="/static/98ff49d3/scripts/yui/storage/storage-min.js"></script>
  <script src="/static/98ff49d3/scripts/hudson-behavior.js" type="text/javascript"></script>
  <script src="/static/98ff49d3/scripts/sortable.js" type="text/javascript"></script>
  <script>
    crumb.init("", "");
  </script>
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/container/assets/container.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/assets/skins/sam/skin.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/container/assets/skins/sam/container.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/button/assets/skins/sam/button.css" />
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/menu/assets/skins/sam/menu.css" />
  <meta name="ROBOTS" content="INDEX,NOFOLLOW" />
  <script src="/static/98ff49d3/scripts/yui/cookie/cookie-min.js"></script>
  <link rel="stylesheet" type="text/css" href="/static/98ff49d3/plugin/sectioned-view/sectioned-view.css" />
</head>

<body id="jenkins" data-version="jenkins-1.596.1" class="yui-skin-sam jenkins-1.596.1"><a href="#skip2content" class="skiplink">Skip to content</a>
  <div id="page-head">
    <div id="header">
      <div class="logo">
        <a id="jenkins-home-link" href="/">
          <img id="jenkins-head-icon" alt="title" src="/static/98ff49d3/images/headshot.png" />
          <img id="jenkins-name-icon" height="34" alt="title" width="139" src="/static/98ff49d3/images/title.png" />
        </a>
      </div>
      <div class="login"> <a href="/login?from=%2F"><b>log in</b></a>
        |
        <a href="/signup"><b>sign up</b></a>
      </div>
      <div class="searchbox hidden-xs">
        <form style="position:relative;" name="search" action="/search/" class="no-json" method="get">
          <div id="search-box-minWidth"></div>
          <div id="search-box-sizer"></div>
          <div id="searchform">
            <input id="search-box" placeholder="search" name="q" class="has-default-text" />
            <a href="http://wiki.jenkins-ci.org/display/JENKINS/Search+Box">
              <img style="width: 16px; height: 16px; " class="icon-help icon-sm" src="/static/98ff49d3/images/16x16/help.png" />
            </a>
            <div id="search-box-completion"></div>
            <script>
              createSearchBox("/search/");
            </script>
          </div>
        </form>
      </div>
    </div>
    <div id="breadcrumbBar">
      <tr id="top-nav">
        <td id="left-top-nav" colspan="2">
          <link rel='stylesheet' href='/adjuncts/98ff49d3/lib/layout/breadcrumbs.css' type='text/css' />
          <script src='/adjuncts/98ff49d3/lib/layout/breadcrumbs.js' type='text/javascript'></script>
          <div class="top-sticker noedge">
            <div class="top-sticker-inner">
              <div id="right-top-nav"></div>
              <ul id="breadcrumbs">
                <li class="item"><a class="model-link inside" href="/">Jenkins</a>
                </li>
                <li class="children" href="/"></li>
              </ul>
              <div id="breadcrumb-menu-target"></div>
            </div>
          </div>
        </td>
      </tr>
    </div>
  </div>
  <div id="page-body">
    <div class="row">
      <div id="side-panel">
        <div id="side-panel-content"></div>
      </div>
      <div id="main-panel">
        <div id="main-panel-content">
          <a name="skip2content"></a>
          <div style="margin: 2em;">
            <form style="text-size:smaller" name="login" action="j_acegi_security_check" method="post">
              <table>
                <tr>
                  <td>User:</td>
                  <td>
                    <input type="text" name="j_username" id="j_username" />
                  </td>
                </tr>
                <tr>
                  <td>Password:</td>
                  <td>
                    <input type="password" name="j_password" />
                  </td>
                </tr>
                <tr>
                  <td align="right">
                    <input id="remember_me" type="checkbox" name="remember_me" />
                  </td>
                  <td>
                    <label for="remember_me">Remember me on this computer</label>
                  </td>
                </tr>
              </table>
              <input name="from" value="/" type="hidden" />
              <input name="Submit" value="log in" class="submit-button primary" type="submit" />
              <script>
                $('j_username').focus();
              </script>
            </form>
            <div style="margin-top:2em"><a href="signup">Create an account</a> if you are not a member yet.</div>
          </div>
        </div>
      </div>
    </div>
  </div>
  <div id="footer-container" class="hidden-xs">
    <div id="footer"><span class="page_generated">
          Page generated:
          May 5, 2015 1:09:35 PM</span><span class="rest_api"><a href="api/">REST API</a></span><span class="jenkins_ver"><a href="http://jenkins-ci.org/">Jenkins ver. 1.596.1</a></span>
      <div id="l10n-dialog" class="dialog"></div>
      <div id="l10n-footer" style="display:none; float:left">
        <a href="#" onclick="return showTranslationDialog();">
          <img src="/static/98ff49d3/plugin/translation/flags.png" />Help us localize this page
        </a>
      </div>
      <script>
        var footer = document.getElementById('l10n-footer');
        var f = document.getElementById('footer');
        f.insertBefore(footer, f.firstChild);
        footer.style.display = "block";

        var translation = {};
        translation.bundles = "6CPNEARN8E/l4k/4nMQznROeAYoCO7auJUGWM6qMGBK2/ELamFqR7whqOnrQ+pYEU4X6xVw11/3WEM16VclDS66Hi2QY5S41H0NSwFiE07KHND+iP3c2Zb4MiiqIOrGRLMJEPdu/j3QYQ5Yp2rkj/ISZWOGFVY86zs/0JsDEw+VJN9dlaSkRcelDKNfziTE/8K7Sabhhd0we7ATzNTgNrfenUCaCdwR7BqPc7354m+fmVz7/8DpcYBMzl78E3+DpUF6sJa18uD7OkgPMNYz8lIM9Bx1ZXanyOk49M8Sea9qj+teMndv9kiyawWnloiBlg3KdK0DfZs1v+RbCQ/HnYcIcjAZVgKTYD2S0GpSj5oHMFQeTemQRnbj6WMon3u7Z8q3np+0Ucgxcs1LfKqprNmeugoD5jIxCuHhHCQvaHdw=";
        translation.detectedLocale = "";

        function showTranslationDialog() {
          if (!translation.launchDialog)
            loadScript("/static/98ff49d3/plugin/translation/dialog.js");
          else
            translation.launchDialog();
          return false;
        }
      </script>
    </div>
  </div>
</body>

</html>

我需要Jenkins源代码中的这些行,以便能够填写并登录:

<input type="text" name="j_username" id="j_username" />
<input type="password" name="j_password" />
<input name="Submit" value="log in" class="submit-button primary" type="submit" />

这是我正在运行以获取此数据的代码:

  1 require 'rubygems'
  2 require 'nokogiri'
  3 require 'net/https'
  4 require 'openssl'
  5 require 'mechanize'
  6 
  7 class JenkinsTest
  8         # Request the Jenkins webpage
  9         def request_jenkins_webpage
 10                 uri = URI.parse("https://jenkinspage.com:8443")
 11                 http = Net::HTTP.new(uri.host, uri.port)
 12                 http.use_ssl = true
 13                 http.verify_mode = OpenSSL::SSL::VERIFY_NONE
 14                 request = Net::HTTP::Get.new(uri.request_uri)
 15                 response = http.request(request)
 16                 @@page = Nokogiri::HTML(response.body)
 17         end
 18 
 19         def print_jenkins_webpage
 20                 puts @@page
 21         end
 22 end

一些额外的注释:网络有代理,没有登录/密码;詹金斯的证书是自签名的;

我的问题是,为什么会发生这种情况,我该如何解决?

提前致谢!

1 个答案:

答案 0 :(得分:0)

感谢@theTinMan,@ MarkThomas和同事的帮助,我通过Mechanize和Nokogiri设法登录jenkins并收集页面的XML:

.popover-title {text-align: right;}