我想将此网站中的所有.xls
或.xlsx
或.csv
下载到指定的文件夹中。
https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009
我已经研究过机械化,美丽的汤,urllib2等.Minodize在Python 3中不起作用,urllib2也遇到了Python 3的问题,我寻找解决方法,但我无法做到。所以,我目前正在尝试使用Beautiful Soup工作。
我找到了一些示例代码并尝试修改它以适应我的问题,如下所示 -
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
但是,在运行时,此代码不会从目标页面中提取文件,也不会输出任何失败消息(例如'无法下载')。
答案 0 :(得分:4)
您的脚本的问题是:
url
有一个尾随/
,在请求时会显示无效页面,而不会列出您要下载的文件。soup.select(...)
中的CSS选择器正在选择div
,其属性webpartid
在该链接文档中的任何位置都不存在。try:...except:
块阻止您查看尝试下载文件时生成的错误。使用except
块没有特定的例外是不好的做法,应该避免使用。将 获取正确文件并尝试下载的代码的修改版本如下:
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
但是,如果你运行它,你会注意到抛出了urllib.error.HTTPError: HTTP Error 403: Forbidden
异常,即使该文件可以在浏览器中下载。
起初我认为这是推荐检查(以防止热链接),但是如果您在浏览器中查看请求(例如Chrome开发者工具),您会注意到
最初的http://
请求也会被阻止,然后Chrome会针对同一文件尝试https://
请求。
换句话说,请求必须通过HTTPS才能工作(尽管页面中的URL说明了)。要解决此问题,您需要在使用请求的网址之前将http:
重写为https:
。以下代码将正确修改URL并下载文件。我还添加了一个变量来指定输出文件夹,使用os.path.join
将其添加到文件名中:
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
答案 1 :(得分:1)
I found this to be a good working example, using the BeautifulSoup4
, requests
, and wget
modules for Python 2.7:
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
file_types = ['.xls', '.xlsx', '.csv']
for file_type in file_types:
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
答案 2 :(得分:1)
i tried above code still giving me urllib.error.HTTPError: HTTP Error 403: Forbidden
Also tried by adding user agents my modified code
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import Request,urlopen, urlretrieve
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
URL = Request('https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009', headers=headers)
#URL = 'https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = 'E:\python\out' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
答案 3 :(得分:0)
这对我来说效果最好......使用python3
"models": {
"":{
...
...
}
"DataModel": {
"type": "sap.ui.model.json.JSONModel",
"settings": {},
"uri": "/comments",
"preload": false
}
}