Python请求转到链接和下载

时间:2016-04-19 14:27:42

标签: python python-3.x web-scraping python-requests

我希望以自动方式执行以下操作:

  1. 转到此链接:https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract-Plan-State-County-DL.xml

  2. 点击页面底部的链接(以当前年份和月份结束(即http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract-Plan-State-County-Items/Monthly-Enrollment-by-CPSC-2016-04.html

  3. 在下一页,从“下载”下方的顶部链接下载zip文件: CPSC每月注册 - 2016年4月[ZIP,20MB]

  4. 到目前为止,我有以下内容来获取当前年份和月份,但我需要其他方面的帮助......

    from datetime import datetime
    import calendar
    Day = datetime.now().day
    Month = datetime.now().month
    Year = datetime.now().year
    m=calendar.month_name[Month]
    

1 个答案:

答案 0 :(得分:2)

您需要一个XML解析器来从XML提要和HTML解析器中提取链接,以提取zip文件的链接。为此,我们将分别使用lxml.etreelxml.html。工作实施:

from datetime import datetime
from urllib.request import urlretrieve
from urllib.parse import urljoin

import requests
from lxml import etree
from lxml import html


date_part = datetime.now().strftime("%Y-%m")
with requests.Session() as session:
    # get the XML feed and extract the link
    response = session.get("https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract-Plan-State-County-DL.xml")
    root = etree.fromstring(response.content)
    link = root.xpath("//item/link[contains(., '-%s.html')]/text()" % date_part)[0]

    # follow the link and extract the link to the zip file
    response = session.get(link)
    root = html.fromstring(response.content)
    zip_link = root.xpath("//a[@type='application/zip']/@href")[0]
    link = urljoin(link, zip_link)

    # download zip
    urlretrieve(link, filename="my.zip")