机器人允许网站但被识别并被拒绝

时间:2018-11-24 10:45:02

标签: ruby-on-rails ruby web-scraping nokogiri

我需要对允许机器人访问的网站进行网页抓取。下面是robot.txt文件的内容。

User-agent: *
Disallow:
Sitemap:https://www.sample.com/sitemap-index.xml

但是当我尝试使用nokogiri来获取网站内容时,就会被检测到。

Nokogiri::HTML(open('https://www.sample.com/search?q=test', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))

以下是输出:

> (Document:0x3fda40e7cf70 {
  name = "document",
  children = [
    #(DTD:0x3fda40e9591c { name = "html" }),
    #(Element:0x3fda40e8c95c {
      name = "html",
      attributes = [ #(Attr:0x3fda4071a598 { name = "style", value = "height:100%" })],
      children = [
        #(Element:0x3fda3fefa28c {
          name = "head",
          children = [
            #(Element:0x3fda401a3088 {
              name = "meta",
              attributes = [ #(Attr:0x3fda40ebd7a0 { name = "name", value = "ROBOTS" }), #(Attr:0x3fda40ebd778 { name = "content", value = "NOINDEX, NOFOLLOW" })]
              }),
            #(Element:0x3fda4074faf4 {
              name = "meta",
              attributes = [ #(Attr:0x3fda3ff0beec { name = "name", value = "format-detection" }), #(Attr:0x3fda3ff0bed8 { name = "content", value = "telephone=no" })]
              }),
            #(Element:0x3fda401ca700 {
              name = "meta",
              attributes = [ #(Attr:0x3fda401c2050 { name = "name", value = "viewport" }), #(Attr:0x3fda401c217c { name = "content", value = "initial-scale=1.0" })]
              }),
            #(Element:0x3fda4079a284 {
              name = "meta",
              attributes = [ #(Attr:0x3fda4078bfb8 { name = "http-equiv", value = "X-UA-Compatible" }), #(Attr:0x3fda4078bf04 { name = "content", value = "IE=edge,chrome=1" })]
              })]
          }),
        #(Element:0x3fda407e2e6c {
          name = "body",
          attributes = [ #(Attr:0x3fda430205f0 { name = "style", value = "margin:0px;height:100%" })],
          children = [
            #(Element:0x3fda4072e2a0 {
              name = "iframe",
              attributes = [
                #(Attr:0x3fda3ff45214 {
                  name = "src",
                  value = "/_Incapsula_Resource?SWUDNSAI=28&xinfo=5-66719320-0%200NNN%20RT%281543054979096%20247%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U2&incident_id=245000650118470008-256430953704260629&edet=12&cinfo=04000000"
                  }),
                #(Attr:0x3fda3ff451d8 { name = "frameborder", value = "0" }),
                #(Attr:0x3fda3ff451b0 { name = "width", value = "100%" }),
                #(Attr:0x3fda3ff45188 { name = "height", value = "100%" }),
                #(Attr:0x3fda3ff45174 { name = "marginheight", value = "0px" }),
                #(Attr:0x3fda3ff4514c { name = "marginwidth", value = "0px" })],
              children = [ #(Text "Request unsuccessful. Incapsula incident ID: 245000650118470008-256430953704260629")]
              })]
          })]
      })]
  })

如何实现此网络抓取?

0 个答案:

没有答案