Question

我试图编写一些代码，下载http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/底部发现的停电周的两份最新出版物

它的xlsx文件，我将在之后加载到Excel中。编写代码的编程语言并不重要。

我的第一个想法是使用直接网址，例如http://www.eirgridgroup.com/site-files/library/EirGrid/Outage-Weeks_36(2016)-51(2016)_31%20August.xlsx ，然后制作一些猜测两个最新出版物的网址的代码。但我注意到网址名称中存在一些不一致之处，因此该解决方案无法正常工作。

相反，它可能是刮取网站并使用XPath下载文件的解决方案。我发现这两个最新的出版物总是有以下XPath：

/html/body/div[3]/div[3]/div/div/p[5]/a
/html/body/div[3]/div[3]/div/div/p[6]/a

这是我需要帮助的地方。我是XPath和Web Scraping的新手。我在Python中试过这样的东西

from lxml import html
import requests

page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)

v = tree.xpath('/html/body/div[3]/div[3]/div/div/p[5]/a')

但是v似乎是空的。

任何想法都将不胜感激！

Answer 1

只需使用 contains 找到 hrefs 并切片前两个：

 tree.xpath('//p/a[contains(@href, "/site-files/library/EirGrid/Outage-Weeks")]/@href')[:2]

或者使用[position() < 3]

使用xpath完成所有操作

tree.xpath'(//p/a[contains(@href, "site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/@href')

文件从最新到最旧排序，因此获得前两个文件会为您提供最新的两个文件。

要下载您需要将每个href加入基本网址并将内容写入文件的文件：

from lxml import html
import requests
import os
from urlparse import urljoin # from urllib.parse import urljoin


page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)

v = tree.xpath('(//p/a[contains(@href, "/site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/@href')
for href in v:
    # os.path.basename(href) -> Outage-Weeks_35(2016)-50(2016).xlsx 
    with open(os.path.basename(href), "wb") as f:
        f.write(requests.get(urljoin("http://www.eirgridgroup.com", link)).content)

无法从网站下载xlsx文件 - 刮痧

1 个答案: