我正在尝试抓取一个网站,https://www.searchiqs.com/nybro/(你必须点击"以访客身份登录"才能进入搜索表单。如果我搜索派对1一词就像说"安德鲁"结果有分页,而且,请求类型是POST,因此URL不会改变,并且会话也很快就会超时。所以很快,如果我等了十分钟并刷新搜索URL页面给我一个超时错误。
我最近开始抓刮,所以我大部分时间都在做GET帖子,我可以破译URL。所以到目前为止我已经意识到我将不得不看看DOM。使用Chrome工具,我找到了标题。在网络标签中,我还发现以下内容是从搜索页面传递到结果页面的表单数据
__EVENTTARGET:
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)
__VIEWSTATEGENERATOR:F92D01D0
__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)
BrowserWidth:1243
BrowserHeight:705
ctl00$ContentPlaceHolder1$scrollPos:0
ctl00$ContentPlaceHolder1$txtName:david
ctl00$ContentPlaceHolder1$chkIgnorePartyType:on
ctl00$ContentPlaceHolder1$txtFromDate:
ctl00$ContentPlaceHolder1$txtThruDate:
ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)
ctl00$ContentPlaceHolder1$cboDocType:(ALL)
ctl00$ContentPlaceHolder1$cboTown:(ALL)
ctl00$ContentPlaceHolder1$txtPinNum:
ctl00$ContentPlaceHolder1$txtBook:
ctl00$ContentPlaceHolder1$txtPage:
ctl00$ContentPlaceHolder1$txtUDFNum:
ctl00$ContentPlaceHolder1$txtCaseNum:
ctl00$ContentPlaceHolder1$cmdSearch:Search
隐藏所有大写字母。我还设法弄清了结果结构。
到目前为止,我的剧本非常可怜,因为我对下一步该做什么完全空白。我仍然要做表单提交,分析分页并刮掉结果,但我完全不知道如何继续。
import re
import urlparse
import mechanize
from bs4 import BeautifulSoup
class DocumentFinderScraper(object):
def __init__(self):
self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"
self.br = mechanize.Browser()
self.br.addheaders = [('User-agent',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]
##TO DO
##submit form
#get return URL
#scrape results
#analyze pagination
if __name__ == '__main__':
scraper = DocumentFinderScraper()
scraper.scrape()
非常感谢任何帮助
答案 0 :(得分:2)
我停用了Javascript并访问了https://www.searchiqs.com/nybro/,表单如下所示:
如您所见,登录和以访客身份登录按钮被禁用。这将使Mechanize无法工作,因为它无法处理Javascript,您将无法提交表单。
对于这类问题,您可以使用Selenium,它将模拟完整的浏览器,其缺点是比Mechanize慢。
此代码应使用Selenium登录:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
usr = ""
pwd = ""
driver = webdriver.Firefox()
driver.get("https://www.searchiqs.com/nybro/")
assert "IQS" in driver.title
elem = driver.find_element_by_id("txtUserID")
elem.send_keys(usr)
elem = driver.find_element_by_id("txtPassword")
elem.send_keys(pwd)
elem.send_keys(Keys.RETURN)