从URL下载所有csv文件

时间:2016-08-19 07:44:55

标签: python-2.7 csv download beautifulsoup

我想下载所有csv文件,不知道我是怎么做到的?

from bs4 import BeautifulSoup
import requests
url = requests.get('http://www.football-data.co.uk/englandm.php').text
soup = BeautifulSoup(url)
for link in soup.findAll("a"):
    print link.get("href")

2 个答案:

答案 0 :(得分:1)

您只需要使用 css选择器 a [href $ =。csv] 过滤 hrefs 找到href在 .csv 中的结尾,然后将每个连接到基本URL,请求并最终写入内容:

from bs4 import BeautifulSoup
import requests
from urlparse import urljoin
from os.path import basename

base = "http://www.football-data.co.uk/"
url = requests.get('http://www.football-data.co.uk/englandm.php').text
soup = BeautifulSoup(url)
for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
    with open(basename(link), "w") as f:
        f.writelines(requests.get(link))

这将为您提供五个文件, E0.csv,E1.csv,E2.csv,E3.csv,E4.csv ,其中包含所有数据。

答案 1 :(得分:0)

这样的事情应该有效:

from bs4 import BeautifulSoup
from time import sleep
import requests


if __name__ == '__main__':
    url = requests.get('http://www.football-data.co.uk/englandm.php').text
    soup = BeautifulSoup(url)
    for link in soup.findAll("a"):
        current_link = link.get("href")
        if current_link.endswith('csv'):
            print('Found CSV: ' + current_link)
            print('Downloading %s' % current_link)
            sleep(10)
            response = requests.get('http://www.football-data.co.uk/%s' % current_link, stream=True)
            fn = current_link.split('/')[0] + '_' + current_link.split('/')[1] + '_' + current_link.split('/')[2]
            with open(fn, "wb") as handle:
                for data in response.iter_content():
                    handle.write(data)