我是机械化的新手,正在尝试克服这个可能非常明显的答案。
我整理了一个简短的脚本来在外部站点上进行身份验证,然后单击一个可动态生成CSV文件的链接。
我终于可以单击导出按钮,但是它返回一个AWS URL。
我正在尝试从该JSON响应(如下所示)中下载脚本以下载所述CSV。
Myscript.rb
require 'mechanize'
require 'logger'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'zlib'
USERNAME = "myemail"
PASSWORD = "mysecret"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
mechanize = Mechanize.new do |a|
a.user_agent = USER_AGENT
end
form_page = mechanize.get('https://XXXX.XXXXX.com/signin')
form = form_page.form_with(:id =>'login')
form.field_with(:id => 'user_email').value=USERNAME
form.field_with(:id => 'user_password').value=PASSWORD
page = form.click_button
donations = mechanize.get('https://XXXXX.XXXXXX.com/pages/ACCOUNT/statistics')
puts donations.body
donations = mechanize.get('https://xxx.siteimscraping.com/pages/myaccount/statistics')
bs_csv_download = page.link_with(:text => 'Download CSV')
来自网站的JSON响应,包含指向CSV的链接,我需要通过Mechanize和/或nokogiri进行解析和下载。
{"message":"Find your report at https://s3.amazonaws.com/reports.XXXXXXX.com/XXXXXXX.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=AKIAIKW4BJKQUNOJ6D2A%2F20190228%2Fus-east-1%2Fs3%2Faws4_request\u0026X-Amz-Date=20190228T025844Z\u0026X-Amz-Expires=86400\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=b19b6f1d5120398c850fc03c474889570820d33f5ede5ff3446b7b8ecbaf706e"}
非常感谢您的帮助。
答案 0 :(得分:0)
您可以将其解析为JSON,然后从响应中检索一个子字符串(假设它始终以相同的格式响应):
require 'json'
...
bs_csv_download = page.link_with(:text => 'Download CSV')
json_response = JSON.parse(bs_csv_download)
direct_link = json_response["message"][20..-1]
mechanize.get(direct_link).save('file.csv')
我们通过[20..-1]
在“消息”值中获得第20个字符(-1表示直到字符串末尾)。