我想下载对应于" API#"的列表的所有pdf文档。来自http://imaging.occeweb.com/imaging/UIC1012_1075.aspx
的值到目前为止,我已成功发布了#34; API#"请求,但不知道下一步该做什么。
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
payload = {'txtIndex7':'1','txtIndex2': API}
session = requests.Session()
res = session.post(url,headers=headers,data=payload)
答案 0 :(得分:1)
有点复杂,还需要考虑一些额外的事件验证隐藏输入字段。为此,您首先需要获取页面,收集所有隐藏的值,设置API的值,然后通过以下HTML解析HTML响应发出POST请求。
幸运的是,有一个名为MechanicalSoup
的工具可以帮助您在表单提交请求中自动填充这些隐藏字段。这是一个完整的解决方案,包括用于解析结果表的示例代码:
import mechanicalsoup
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr'):
print([td.get_text() for td in tr.find_all("td")])
答案 1 :(得分:0)
import mechanicalsoup
import urllib
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
Form = '1012'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex7"] = Form
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr')[2:]:
try:
pdf_url = tr.select('td')[0].find('a').get('href')
except:
print('Pdf not found')
else:
pdf_id = tr.select('td')[0].text
response = urllib.urlopen(pdf_url) # for python 2.7, for python 3. urllib.request.urlopen()
pdf_str = "C:\\Data\\"+pdf_id+".pdf"
file = open(pdf_str, 'wb')
file.write(response.read())
file.close()
print('Pdf '+pdf_id+' saved')