从csv链接下载csvs到桌面

时间:2019-05-30 10:34:14

标签: python-3.x csv web-scraping python-requests

问题:

不知道google fu是否会让我再次失败,但是我无法从网址列表中下载csvs。我已经使用requestsbs4来收集网址(最终列表是正确的)-有关更多信息,请参见下面的过程。

然后,我使用urllib按照此处给出的答案之一进行下载:Trying to download data from URL with CSV File,以及其他一些用于下载csv的stackoverflow python答案。

目前,我坚持使用

  

HTTP错误404:未找到

(下面的堆栈跟踪来自于传递User-Agent的最后一次尝试)

----> 9 f = urllib.request.urlopen(req)
     10 print(f.read().decode('utf-8'))
     #other lines

--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

我在这里尝试了添加User-AgentWeb Scraping using Python giving HTTP Error 404: Not Found的解决方案,尽管我本来希望看到403错误而不是404错误代码-但似乎适用于许多OP。

这仍然失败,并显示相同的错误。我很确定我可以通过简单地使用selenium并将csv url传递给.get来解决此问题,但是我想知道我是否可以仅使用请求来解决此问题。


概述:

我访问此页面:

https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice

我获取了所有的每月版本链接,例如Patients Registered at a GP Practice May 2019,然后访问每个页面并获取其中的所有csv链接。

我循环搜索filename:download_url对的最终字典,以尝试下载文件。


问题:

谁能看到我在做什么错或如何解决此问题,这样我就可以不依靠硒而下载文件了?我也不确定实现此目的的最有效方法-也许实际上根本不需要urllib,仅请求就足够了?


Python:

没有用户代理:

import requests
from bs4 import BeautifulSoup as bs
import urllib

base = 'https://digital.nhs.uk/'
all_files = []

with requests.Session() as s:
    r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
    soup = bs(r.content, 'lxml')
    links = [base + item['href'] for item in soup.select('.cta__button')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
        if file_links:
            all_files.append(file_links)  #ignore empty dicts as for some months there is no data yet
        else:
            print('no data : ' + link)

all_files = {k: v for d in all_files for k, v in d.items()}  #flatten list of dicts to single dict


path = r'C:\Users\User\Desktop'

for k,v in all_files.items():
    #print(k,v)
    print(v)
    response = urllib.request.urlopen(v)
    html = response.read()

    with open(path + '\\' + k + '.csv', 'wb') as f:
        f.write(html)
    break  #as only need one test case

添加User-Agent进行测试:

req = urllib.request.Request(
    v, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

1 个答案:

答案 0 :(得分:1)

查看值,向我显示您的链接

https://digital.nhs.uk/https://files.digital.nhs.uk/publicationimport/pub13xxx/pub13932/gp-reg-patients-04-2014-lsoa.csv

我认为您想删除base +,因此请使用此

file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}

而不是:

file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}

编辑:完整代码:

import requests
from bs4 import BeautifulSoup as bs

base = 'https://digital.nhs.uk/'
all_files = []

with requests.Session() as s:
    r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
    soup = bs(r.content, 'lxml')
    links = [base + item['href'] for item in soup.select('.cta__button')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
        if file_links:
            all_files.append(file_links)  #ignore empty dicts as for some months there is no data yet
        else:
            print('no data : ' + link)

all_files = {k: v for d in all_files for k, v in d.items()}  #flatten list of dicts to single dict

path = 'C:/Users/User/Desktop/'

for k,v in all_files.items():
    #print(k,v)
    print(v)
    response = requests.get(v)
    html = response.content

    k = k.replace(':', ' -')
    file = path + k + '.csv'

    with open(file, 'wb' ) as f:
        f.write(html)
    break  #as only need one test case