我有一个page有一个表(表ID =" ctl00_ContentPlaceHolder_ctl00_ctl00_GV" class =" GridListings")我需要废弃。 我经常使用BeautifulSoup& urllib,但在这种情况下问题是该表需要一些时间来加载,所以当我尝试使用BS获取它时它不会被捕获。 由于一些安装问题我不能使用PyQt4,drysracpe或windmill,所以唯一可行的方法是使用Selenium / PhantomJS 我尝试了以下方法,仍然没有成功:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get(url)
wait = WebDriverWait(driver, 10)
table = wait.until(EC.presence_of_element_located(By.CSS_SELECTOR, 'table#ctl00_ContentPlaceHolder_ctl00_ctl00_GV'))
上面的代码没有给我所需的表格内容。 我如何实现这一目标???
答案 0 :(得分:2)
你可以使用 requests 和 bs4,获取数据,几乎所有的asp网站都有一些总是需要像 __ EVENTTARGET , __ EVENTVALIDATION 等..:
from bs4 import BeautifulSoup
import requests
data = {"__EVENTTARGET": "ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV",
"__EVENTARGUMENT": "LISTINGS;0",
"ctl00$ContentPlaceHolder$ctl00$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$hdnProductID": "139",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortField": "Listing Number",
"ctl00$ContentPlaceHolder$ctl00$ctl00$drpSortDirection": "A-Z, Low-High",
"__ASYNCPOST": "true"}
对于实际帖子,我们需要为帖子数据添加更多值:
post = "https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
with requests.Session() as s:
s.headers.update({"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
soup = BeautifulSoup(s.get(post).content)
data["__VIEWSTATEGENERATOR"] = soup.select_one("#__VIEWSTATEGENERATOR")["value"]
data["__EVENTVALIDATION"] = soup.select_one("#__EVENTVALIDATION")["value"]
data["__VIEWSTATE"] = soup.select_one("#__VIEWSTATE")["value"]
r = s.post(post, data=data)
soup2 = BeautifulSoup(r.content)
table = soup2.select_one("div.GridListings")
print(table)
运行代码时,您将看到打印的表格。
答案 1 :(得分:0)
如果您想废弃某些内容,最好首先安装一个网络调试器(例如Firebug Mozilla Firefox)来观看您要废弃的网站是如何工作的。
接下来,您需要复制网站连接到后台的过程
正如您所说,您要废弃的内容是异步加载的(仅当文档准备就绪时)
假设调试器正在运行并且您已刷新页面,您将在网络选项卡上看到以下请求:
POST https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx
实现目标的最终流程将是:
请参阅下面的工作代码:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from bs4 import BeautifulSoup
import requests
base_url="https://seahawks.strmarketplace.com/Charter-Seat-Licenses/Charter-Seat-Licenses.aspx"
#create requests session
s = requests.session()
#get index page
r=s.get(base_url)
#soup page
bs=BeautifulSoup(r.text)
#extract FORM html
form_soup= bs.find('form',{'name':'aspnetForm'})
#extracting all inputs
input_div = form_soup.findAll("input")
#build the data parameters for POST request
#we add some required <fixed> data parameters for post
data={
'__EVENTARGUMENT':'LISTINGS;0',
'__EVENTTARGET':'ctl00$ContentPlaceHolder$ctl00$ctl00$RadAjaxPanel_GV',
'__EVENTVALIDATION':'/wEWGwKis6fzCQLDnJnSDwLq4+CbDwK9jryHBQLrmcucCgL56enHAwLRrPHhCgKDk6P+CwL1/aWtDQLm0q+gCALRvI2QDAKch7HjBAKWqJHWBAKil5XsDQK58IbPAwLO3dKwCwL6uJOtBgLYnd3qBgKyp7zmBAKQyTBQK9qYAXAoieq54JAuG/rDkC1djKyQMC1qnUtgoC0OjaygUCv4b7sAhfkEODRvsa3noPfz2kMsxhAwlX3Q=='
}
#we add some <dynamic> data parameters
for input_d in input_div:
try:
data[ input_d['name'] ] =input_d['value']
except:
pass #skip unused input field
#post request
r2=s.post(base_url,data=data)
#write the result
with open("post_result.html","w") as f:
f.write(r2.text.encode('utf8'))
现在,请查看&#34; post_result.html&#34;内容,你会发现数据!
此致