使用Python和BeautifulSoup

时间:2016-01-06 12:10:43

标签: python web-scraping beautifulsoup

我想将此网站中的所有.xls.xlsx.csv下载到指定的文件夹中。

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009

我已经研究过机械化,美丽的汤,urllib2等.Minodize在Python 3中不起作用,urllib2也遇到了Python 3的问题,我寻找解决方法,但我无法做到。所以,我目前正在尝试使用Beautiful Soup工作。

我找到了一些示例代码并尝试修改它以适应我的问题,如下所示 -

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

但是,在运行时,此代码不会从目标页面中提取文件,也不会输出任何失败消息(例如'无法下载')。

  • 如何使用BeautifulSoup从页面中选择Excel文件?
  • 如何使用Python将这些文件下载到本地文件?

4 个答案:

答案 0 :(得分:4)

您的脚本的问题是:

  1. url有一个尾随/,在请求时会显示无效页面,而不会列出您要下载的文件。
  2. soup.select(...)中的CSS选择器正在选择div,其属性webpartid在该链接文档中的任何位置都不存在。
  3. 您正在加入网址并引用它,即使链接在页面中作为绝对网址提供,也不需要引用。
  4. try:...except:块阻止您查看尝试下载文件时生成的错误。使用except块没有特定的例外是不好的做法,应该避免使用。
  5. 获取正确文件并尝试下载的代码的修改版本如下:

    from bs4 import BeautifulSoup
    # Python 3.x
    from urllib.request import urlopen, urlretrieve, quote
    from urllib.parse import urljoin
    
    # Remove the trailing / you had, as that gives a 404 page
    url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
    u = urlopen(url)
    try:
        html = u.read().decode('utf-8')
    finally:
        u.close()
    
    soup = BeautifulSoup(html, "html.parser")
    
    # Select all A elements with href attributes containing URLs starting with http://
    for link in soup.select('a[href^="http://"]'):
        href = link.get('href')
    
        # Make sure it has one of the correct extensions
        if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
            continue
    
        filename = href.rsplit('/', 1)[-1]
        print("Downloading %s to %s..." % (href, filename) )
        urlretrieve(href, filename)
        print("Done.")
    

    但是,如果你运行它,你会注意到抛出了urllib.error.HTTPError: HTTP Error 403: Forbidden异常,即使该文件可以在浏览器中下载。 起初我认为这是推荐检查(以防止热链接),但是如果您在浏览器中查看请求(例如Chrome开发者工具),您会注意到 最初的http://请求也会被阻止,然后Chrome会针对同一文件尝试https://请求。

    换句话说,请求必须通过HTTPS才能工作(尽管页面中的URL说明了)。要解决此问题,您需要在使用请求的网址之前将http:重写为https:。以下代码将正确修改URL并下载文件。我还添加了一个变量来指定输出文件夹,使用os.path.join将其添加到文件名中:

    import os
    from bs4 import BeautifulSoup
    # Python 3.x
    from urllib.request import urlopen, urlretrieve
    
    URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
    OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder
    
    u = urlopen(URL)
    try:
        html = u.read().decode('utf-8')
    finally:
        u.close()
    
    soup = BeautifulSoup(html, "html.parser")
    for link in soup.select('a[href^="http://"]'):
        href = link.get('href')
        if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
            continue
    
        filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
    
        # We need a https:// URL for this site
        href = href.replace('http://','https://')
    
        print("Downloading %s to %s..." % (href, filename) )
        urlretrieve(href, filename)
        print("Done.")
    

答案 1 :(得分:1)

I found this to be a good working example, using the BeautifulSoup4, requests, and wget modules for Python 2.7:

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'

file_types = ['.xls', '.xlsx', '.csv']

for file_type in file_types:

    response = requests.get(url)

    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            if file_type in link['href']:
                full_path = url + link['href']
                wget.download(full_path)

答案 2 :(得分:1)

i tried above code still giving me urllib.error.HTTPError: HTTP Error 403: Forbidden


Also tried by adding user agents my modified code


import os

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import Request,urlopen, urlretrieve

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
URL = Request('https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009', headers=headers)

#URL = 'https://www.rbi.org.in/scripts/bs_viewcontent.aspx?Id=2009'

OUTPUT_DIR = 'E:\python\out'  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

答案 3 :(得分:0)

这对我来说效果最好......使用python3

"models": {
   "":{
    ...
    ...
   }
   "DataModel": {
     "type": "sap.ui.model.json.JSONModel",
     "settings": {},
     "uri": "/comments",
     "preload": false
   }
}