为什么我在搜索表单中搜索“”空格时没有在[this] [1]页面上获取产品我只获取菜单而不是搜索结果产品
Ruby代码:
require 'nokogiri'
require 'mysql2'
require 'logger'
require 'mechanize'
agent = Mechanize.new{|a| a.log = Logger.new(STDERR) }
agent.user_agent_alias = 'Windows Mozilla'
agent.read_timeout = 60
def add_cookie(agent, uri, cookie)
uri = URI.parse(uri)
Mechanize::Cookie.parse(uri, cookie) do |cookie|
agent.cookie_jar.add(uri, cookie)
end
end
login_page = agent.get "http://www.example.com.mx/login.php?location=%2F"
login_form = login_page.form_with(:method => 'POST')
email_field = login_form.field_with(name: "correo_ingresar")
password_field = login_form.field_with(name: "password")
email_field.value = 'user@example.com'
password_field.value = 'password'
home_page = login_form.submit
myarray = home_page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
myarray.each{|line| add_cookie agent, 'http://www.example.com.mx', "#{line[0]}=#{line[1]}"}
add_cookie(agent, 'http://www.example.com.mx', "forzar_existencias=1; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "articulos_mostrar=50; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "forz_existencias=1=; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "no_actualiza=1; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "orden_mostrar=8; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "page=1; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "precio_inicio=0; path=/; domain=www.example.com.mx")
add_cookie(agent, 'http://www.example.com.mx', "location=%2Farticulos.php%3Fbuscar%3D%2B; path=/; domain=www.example.com.mx")
search_form = home_page.forms.first
search_field = search_form.field_with(name: "buscar")
search_field.value = ' '
search_results = search_form.submit
resultados = 'http://example.com.mx/articulos.php?buscar=+'
我使用firebug为firefox下载了Live HTTP Headers插件。当我填写空格并单击[网页] [1]上的搜索按钮时,我会在实时HTTP标题上获得以下结果。
http://example.com.mx/articulos.php?buscar=+
GET /articulos.php?buscar=+ HTTP/1.1
Host: example.com.mx
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://example.com.mx/articulos.php?buscar=+
Cookie: _ga=GA1.3.162897808.1438611502; _gat=1
Connection: keep-alive
HTTP/1.1 200 OK
Date: Sat, 08 Aug 2015 04:29:40 GMT
Server: Apache
x-powered-by: PHP/5.4.30
Cache-Control: no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: 0
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
----------------------------------------------------------
http://www.google-analytics.com/collect?v=1&_v=j37&a=1988602157&t=pageview&_s=1&dl=http%3A%2F%2Fexample.com.mx%2Farticulos.php%3Fbuscar%3D%2B&ul=en-us&de=UTF-8&dt=Sistemas%20Aplicados&sd=24-bit&sr=1920x1080&vp=1903x969&je=0&_u=AACAAEABI~&jid=&cid=162897808.1438611502&tid=UA-58813310-1&z=90642832
GET /collect?v=1&_v=j37&a=1988602157&t=pageview&_s=1&dl=http%3A%2F%2Fexample.com.mx%2Farticulos.php%3Fbuscar%3D%2B&ul=en-us&de=UTF-8&dt=Sistemas%20Aplicados&sd=24-bit&sr=1920x1080&vp=1903x969&je=0&_u=AACAAEABI~&jid=&cid=162897808.1438611502&tid=UA-58813310-1&z=90642832 HTTP/1.1
Host: www.google-analytics.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: image/png,image/*;q=0.8,*/*;q=0.5
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://example.com.mx/articulos.php?buscar=+
Connection: keep-alive
HTTP/1.1 200 OK
Pragma: no-cache
Expires: Mon, 07 Aug 1995 23:30:00 GMT
Access-Control-Allow-Origin: *
Last-Modified: Sun, 17 May 1998 03:00:00 GMT
x-content-type-options: nosniff
Content-Type: image/gif
Date: Wed, 29 Jul 2015 12:33:33 GMT
Server: Golfe2
Content-Length: 35
Age: 834969
Alternate-Protocol: 80:quic,p=0
Cache-Control: private, no-cache, no-cache=Set-Cookie, proxy-revalidate
----------------------------------------------------------
http://example.com.mx/resultados.php
POST /resultados.php HTTP/1.1
Host: example.com.mx
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Referer: http://example.com.mx/articulos.php?buscar=+
Content-Length: 204
Cookie: _ga=GA1.3.162897808.1438611502; _gat=1
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache
opcion=&buscar=+&page=1&articulos_mostrar=10&orden_mostrar=1&seccion=&linea=&sublinea=&forz_existencias=1&precio_inicio=0&precio_final=20000&location=%252Farticulos.php%253Fbuscar%253D%252B&no_actualiza=1
HTTP/1.1 200 OK
Date: Sat, 08 Aug 2015 04:29:42 GMT
Server: Apache
x-powered-by: PHP/5.4.30
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
----------------------------------------------------------
问题是:如何在网页上显示完整的产品,以便我可以开始抓取,如果它有引用链接并且它不会自动获取产品。 [This] [2]是生成的HTML:
答案 0 :(得分:0)
我提供了2个解决方案,但只有一个使用了您在问题中提出的POST:
require 'mechanize'
agent = Mechanize.new
agent.get("http://www.sistemasaplicados.com.mx/")
agent.page.forms.first.field_with(name: "buscar").value = ' '
result_page = agent.page.forms.first.submit
另一个选项是对您的搜索词进行编码,并使用nokogiri直接在简单的GET请求(在URL中编码)中使用它。在您的特定情况下,搜索“160GB”会产生以下网址http://www.sistemasaplicados.com.mx/articulos.php?buscar=160GB
,您只能GET
。
<强>更新强> 当手动检查发生了什么时,您应该看看如果禁用JavaScript会发生什么(在这种情况下,没有结果)。然后,使用浏览器的“检查器”,“控制台”或“开发人员工具”(通常按F12打开),找出发生的情况。在您的情况下,对resultados.php的POST请求已完成。我发现了firefox,dev工具,“网络”选项卡。您还可以在POST请求中找到要启动的相关参数。