我希望以自动方式执行以下操作:
点击页面底部的链接(以当前年份和月份结束(即http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract-Plan-State-County-Items/Monthly-Enrollment-by-CPSC-2016-04.html)
在下一页,从“下载”下方的顶部链接下载zip文件: CPSC每月注册 - 2016年4月[ZIP,20MB]
到目前为止,我有以下内容来获取当前年份和月份,但我需要其他方面的帮助......
from datetime import datetime
import calendar
Day = datetime.now().day
Month = datetime.now().month
Year = datetime.now().year
m=calendar.month_name[Month]
答案 0 :(得分:2)
您需要一个XML解析器来从XML提要和HTML解析器中提取链接,以提取zip文件的链接。为此,我们将分别使用lxml.etree
和lxml.html
。工作实施:
from datetime import datetime
from urllib.request import urlretrieve
from urllib.parse import urljoin
import requests
from lxml import etree
from lxml import html
date_part = datetime.now().strftime("%Y-%m")
with requests.Session() as session:
# get the XML feed and extract the link
response = session.get("https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Enrollment-by-Contract-Plan-State-County-DL.xml")
root = etree.fromstring(response.content)
link = root.xpath("//item/link[contains(., '-%s.html')]/text()" % date_part)[0]
# follow the link and extract the link to the zip file
response = session.get(link)
root = html.fromstring(response.content)
zip_link = root.xpath("//a[@type='application/zip']/@href")[0]
link = urljoin(link, zip_link)
# download zip
urlretrieve(link, filename="my.zip")