我需要对允许机器人访问的网站进行网页抓取。下面是robot.txt文件的内容。
User-agent: *
Disallow:
Sitemap:https://www.sample.com/sitemap-index.xml
但是当我尝试使用nokogiri
来获取网站内容时,就会被检测到。
Nokogiri::HTML(open('https://www.sample.com/search?q=test', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))
以下是输出:
> (Document:0x3fda40e7cf70 {
name = "document",
children = [
#(DTD:0x3fda40e9591c { name = "html" }),
#(Element:0x3fda40e8c95c {
name = "html",
attributes = [ #(Attr:0x3fda4071a598 { name = "style", value = "height:100%" })],
children = [
#(Element:0x3fda3fefa28c {
name = "head",
children = [
#(Element:0x3fda401a3088 {
name = "meta",
attributes = [ #(Attr:0x3fda40ebd7a0 { name = "name", value = "ROBOTS" }), #(Attr:0x3fda40ebd778 { name = "content", value = "NOINDEX, NOFOLLOW" })]
}),
#(Element:0x3fda4074faf4 {
name = "meta",
attributes = [ #(Attr:0x3fda3ff0beec { name = "name", value = "format-detection" }), #(Attr:0x3fda3ff0bed8 { name = "content", value = "telephone=no" })]
}),
#(Element:0x3fda401ca700 {
name = "meta",
attributes = [ #(Attr:0x3fda401c2050 { name = "name", value = "viewport" }), #(Attr:0x3fda401c217c { name = "content", value = "initial-scale=1.0" })]
}),
#(Element:0x3fda4079a284 {
name = "meta",
attributes = [ #(Attr:0x3fda4078bfb8 { name = "http-equiv", value = "X-UA-Compatible" }), #(Attr:0x3fda4078bf04 { name = "content", value = "IE=edge,chrome=1" })]
})]
}),
#(Element:0x3fda407e2e6c {
name = "body",
attributes = [ #(Attr:0x3fda430205f0 { name = "style", value = "margin:0px;height:100%" })],
children = [
#(Element:0x3fda4072e2a0 {
name = "iframe",
attributes = [
#(Attr:0x3fda3ff45214 {
name = "src",
value = "/_Incapsula_Resource?SWUDNSAI=28&xinfo=5-66719320-0%200NNN%20RT%281543054979096%20247%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U2&incident_id=245000650118470008-256430953704260629&edet=12&cinfo=04000000"
}),
#(Attr:0x3fda3ff451d8 { name = "frameborder", value = "0" }),
#(Attr:0x3fda3ff451b0 { name = "width", value = "100%" }),
#(Attr:0x3fda3ff45188 { name = "height", value = "100%" }),
#(Attr:0x3fda3ff45174 { name = "marginheight", value = "0px" }),
#(Attr:0x3fda3ff4514c { name = "marginwidth", value = "0px" })],
children = [ #(Text "Request unsuccessful. Incapsula incident ID: 245000650118470008-256430953704260629")]
})]
})]
})]
})
如何实现此网络抓取?