我是编写代码的新手,我正在尝试编写代码来抓取特定网站。问题是这个网站有一个页面接受使用条件和隐私页面。网站可以看到这一点:http://cpdocket.cp.cuyahogacounty.us/
我需要以某种方式绕过这个页面,我不知道如何。我正在用Java编写我的代码,到目前为止,已经有工作代码可以删除任何网站的源代码。这段代码是:
import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.StringBuilder;
import java.io.IOException;
// Scraper class takes an input of a string, and returns the source code of the of the website
public class Scraper {
private static String url; // the input website to be scraped
//constructor
public Scraper(String url) {
this.url = url;
}
//scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved
//so it is able to be parsed by another method
public static String scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection(); // connects to the created url
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
String inputLine; //creates a new variable of string
StringBuilder a = new StringBuilder(); // creates stringbuilder
//loop appends to the string builder as long as there is information
while ((inputLine = in.readLine()) != null)
a.append(inputLine);
in.close();
return a.toString();
}
}
非常感谢任何关于如何做到这一点的建议。
我正在根据ruby代码重写代码。代码是:
def initializeSession()
## SETUP # POST headers
post_header = Hash.new()
post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us'
post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0'
post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
post_header['Accept-Language'] = 'en-US,en;q=0.5'
post_header['Accept-Encoding'] = 'gzip, deflate'
post_header['X-Requested-With'] = 'XMLHttpRequest'
post_header['X-MicrosoftAjax'] = 'Delta=true'
post_header['Cache-Control'] = 'no-cache'
post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8'
post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request
# post_header['Content-Length'] = '12197'
post_header['Connection'] = 'keep-alive'
post_header['Pragma'] = 'no-cache'
# STEP # set up simulated browser and make first request
#browser = SimBrowser.new()
#logname = 'log.txt'
#s = Scribe.new(logname)
session_cookie = 'ASP.NET_SessionId'
url = 'http://cpdocket.cp.cuyahogacounty.us/'
@browser.http_get(url)
#puts browser.get_body() # debug
puts 'DEBUG: session cookie: ' + @browser.get_cookie_var(session_cookie)
@log.slog('DEBUG: home page response code: expected 200, actual ' + @browser.get_response().code)
# s.flog('### HOME PAGE RESPONSE')
# s.flog(browser.get_body()) # debug
# STEP # send our acceptance of the terms of service
data = {
'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes',
'__EVENTARGUMENT'=>'',
'__EVENTTARGET'=>'',
'__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD',
'__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B'
}
#post_header['Referer'] = url
@browser.http_post(url, data, post_header)
@log.slog('DEBUG: accept terms response code: expected 200, actual ' + @browser.get_response().code)
@log.flog('### TOS ACCPTANCE RESPONSE')
# @log.flog(@browser.get_body()) # debug
end
这可以用Java完成吗?
答案 0 :(得分:0)
如果您不明白如何操作,最好的学习方法是手动执行此操作,同时观看FireBug(在Firefox上)或IE,Chrome或Safari的等效工具。
当用户接受条款和条件时,您必须在代码中复制任何内容。手动条件。
您还必须意识到,呈现给用户的UI可能不会直接以HTML格式发送,它可能由通常在浏览器上运行的Javascript动态构建。如果您不准备完全模拟浏览器以维护DOM并执行Javascript,那么这可能是不可能的。