从网站上的按钮下载文件的Python脚本

时间:2018-03-27 18:43:38

标签: python html

我想通过点击以下网址中的“导出到Excel”按钮来下载xls文件:https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD

更具体地说,按钮:name =“ctl00 $ MainContent $ btndata”。我已经能够使用selenium做到这一点,但是,我计划用这个脚本构建一个docker镜像并作为docker容器运行,因为这个xls定期更新,我需要本地机器上的最新数据,它不会打开浏览器通常会获取此数据是有意义的。我知道有无头版本的chrome和firefox虽然我不相信它们支持下载。此外,我知道web get在这种情况下不起作用,因为该按钮不是资源的静态链接。也许有一个完全不同的方法来下载和更新这些数据到我的电脑?

import urllib
import requests
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
    'Origin': 'https://www.tampagov.net',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
    'Accept-Encoding': 'gzip,deflate,br',
    'Accept-Language': 'en-US,en;q=0.5',
}

class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f, "html.parser")
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']

formData = (
    ('__EVENTVALIDATION', eventvalidation),
    ('__VIEWSTATE', viewstate),
    ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
    ('Accept-Encoding', 'gzip, deflate, br'),
    ('Accept-Language', 'en-US,en;q=0.5'),
    ('Host', 'apps,tampagov.net'),
    ('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'))



payload = urllib.urlencode(formData)
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", params=payload)
print(r.status_code, r.reason)

2 个答案:

答案 0 :(得分:0)

找出你需要提取的网址@Sphinx解释,然后使用类似于以下内容的方式进行模拟:

import urllib.request
import urllib.parse

data = urllib.parse.urlencode({...})
data = data.encode('ascii')

with urllib.request.urlopen("http://...", data) as fd:
    print(fd.read().decode('utf-8'))

查看urllib的文档。

答案 1 :(得分:0)

首先:我删除了import urllib,因为“请求”就足够了。

您遇到的一些问题:

  1. 您不需要创建一个嵌套元组然后应用urllib.urlencode,而是使用一个字典,这是请求如此受欢迎的一个原因。

  2. 您最好填充http post请求的所有参数。如下所示,否则,请求可能会被后端拒绝。

  3. 我添加了一个简单的代码将内容保存到本地。

  4. PS:对于那些表单参数,您可以通过分析来自http get的html来获取其值。您还可以根据需要自定义参数,例如页面大小等。

    以下是工作样本:

    import requests
    from bs4 import BeautifulSoup
    from tqdm import tqdm
    
    def downloadExcel():
        headers = {
            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
            'Origin': 'https://www.tampagov.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
            'Accept-Encoding': 'gzip,deflate,br',
            'Accept-Language': 'en-US,en;q=0.5',
        }
    
        r = requests.get("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", headers=headers)
        # parse and retrieve two vital form values
        if not r.status_code == 200:
            print('Error')
            return
        soup = BeautifulSoup(r.content, "html.parser")
        viewstate = soup.select("#__VIEWSTATE")[0]['value']
        eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
        print ('__VIEWSTATE:', viewstate)
        print ('__EVENTVALIDATION:', eventvalidation)
        formData = {
            '__EVENTVALIDATION': eventvalidation,
            '__VIEWSTATE': viewstate,
            '__EVENTTARGET': '',
            '__EVENTARGUMENT': '',
            '__VIEWSTATEGENERATOR': '49DF2C80',
            'MainContent_RadScriptManager1_TSM':""";;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:59e0a739-153b-40bd-883f-4e212fc43305:ea597d4b:b25378d2;Telerik.Web.UI, Version=2015.2.826.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:c2ba43dc-851e-4009-beab-3032480b6a4b:16e4e7cd:f7645509:24ee1bba:c128760b:874f8ea2:19620875:4877f69a:f46195d3:92fe8ea0:fa31b949:490a9d4e:bd8f85e4:58366029:ed16cbdc:2003d0b8:88144a7a:1e771326:aa288e2d:b092aa46:7c926187:8674cba1:ef347303:2e42e72a:b7778d6c:c08e9f8a:e330518b:c8618e41:e4f8f289:1a73651d:16d8629e:59462f1:a51ee93e""",
            'search_block_form':'',
            'ctl00$MainContent$btndata':'Export to Excel',
            'ctl00_MainContent_RadWindow1_C_RadGridVehicles_ClientState':'',
            'ctl00_MainContent_RadWindow1_ClientState':'',
            'ctl00_MainContent_RadWindowManager1_ClientState':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl00$PageSizeComboBox':'20',
            'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time$dateInput':'',
            'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_dateInput_ClientState':'{"enabled":true,"emptyMessage":"","validationText":"","valueAsString":"","minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00","lastSetTextBoxValue":""}',
            'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_ClientState':'{"minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00"}',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1address':'',
            'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1address_ClientState':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1case_description':'',
            'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1case_description_ClientState':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_grid':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1report_number':'',
            'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1report_number_ClientState':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_max_date':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_rowcount':'',
            'ctl00$MainContent$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox':'20',
            'ctl00_MainContent_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState':'',
            'ctl00_MainContent_RadGrid1_rfltMenu_ClientState':'',
            'ctl00_MainContent_RadGrid1_gdtcSharedTimeView_ClientState':'',
            'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_SD':'[]',
            'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_AD':'[[1900,1,1],[2099,12,31],[2018,3,29]]',
            'ctl00_MainContent_RadGrid1_ClientState':'',
            }
    
        # second HTTP request with form data
        r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", data=formData, headers=headers)
        print('received:', r.status_code, len(r.content))
        with open(r"C:\Users\xxx\Desktop\test\test\apps.xls", "wb") as handle:
            for data in tqdm(r.iter_content()):
                handle.write(data)
    
    downloadExcel()