问题:
不知道google fu是否会让我再次失败,但是我无法从网址列表中下载csvs。我已经使用requests
和bs4
来收集网址(最终列表是正确的)-有关更多信息,请参见下面的过程。
然后,我使用urllib
按照此处给出的答案之一进行下载:Trying to download data from URL with CSV File,以及其他一些用于下载csv的stackoverflow python答案。
目前,我坚持使用
HTTP错误404:未找到
(下面的堆栈跟踪来自于传递User-Agent的最后一次尝试)
----> 9 f = urllib.request.urlopen(req)
10 print(f.read().decode('utf-8'))
#other lines
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 404: Not Found
我在这里尝试了添加User-Agent
:Web Scraping using Python giving HTTP Error 404: Not Found的解决方案,尽管我本来希望看到403错误而不是404错误代码-但似乎适用于许多OP。
这仍然失败,并显示相同的错误。我很确定我可以通过简单地使用selenium并将csv url传递给.get来解决此问题,但是我想知道我是否可以仅使用请求来解决此问题。
概述:
我访问此页面:
我获取了所有的每月版本链接,例如Patients Registered at a GP Practice May 2019
,然后访问每个页面并获取其中的所有csv链接。
我循环搜索filename:download_url
对的最终字典,以尝试下载文件。
问题:
谁能看到我在做什么错或如何解决此问题,这样我就可以不依靠硒而下载文件了?我也不确定实现此目的的最有效方法-也许实际上根本不需要urllib,仅请求就足够了?
Python:
没有用户代理:
import requests
from bs4 import BeautifulSoup as bs
import urllib
base = 'https://digital.nhs.uk/'
all_files = []
with requests.Session() as s:
r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
soup = bs(r.content, 'lxml')
links = [base + item['href'] for item in soup.select('.cta__button')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
if file_links:
all_files.append(file_links) #ignore empty dicts as for some months there is no data yet
else:
print('no data : ' + link)
all_files = {k: v for d in all_files for k, v in d.items()} #flatten list of dicts to single dict
path = r'C:\Users\User\Desktop'
for k,v in all_files.items():
#print(k,v)
print(v)
response = urllib.request.urlopen(v)
html = response.read()
with open(path + '\\' + k + '.csv', 'wb') as f:
f.write(html)
break #as only need one test case
添加User-Agent进行测试:
req = urllib.request.Request(
v,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
答案 0 :(得分:1)
查看值,向我显示您的链接
https://digital.nhs.uk/https://files.digital.nhs.uk/publicationimport/pub13xxx/pub13932/gp-reg-patients-04-2014-lsoa.csv
我认为您想删除base +
,因此请使用此
file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
而不是:
file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
编辑:完整代码:
import requests
from bs4 import BeautifulSoup as bs
base = 'https://digital.nhs.uk/'
all_files = []
with requests.Session() as s:
r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
soup = bs(r.content, 'lxml')
links = [base + item['href'] for item in soup.select('.cta__button')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
if file_links:
all_files.append(file_links) #ignore empty dicts as for some months there is no data yet
else:
print('no data : ' + link)
all_files = {k: v for d in all_files for k, v in d.items()} #flatten list of dicts to single dict
path = 'C:/Users/User/Desktop/'
for k,v in all_files.items():
#print(k,v)
print(v)
response = requests.get(v)
html = response.content
k = k.replace(':', ' -')
file = path + k + '.csv'
with open(file, 'wb' ) as f:
f.write(html)
break #as only need one test case