无法从某个网站下载文件

时间:2017-12-13 22:23:25

标签: python python-3.x web-scraping download

我在python中编写了一些代码来从网页下载文件。因为我不知道如何从任何网站下载文件,所以我只能从该网站刮取文件链接。如果有人能帮助我实现这一目标,我会非常感激他。非常感谢。

指向该网站的链接:web_link

这是我的尝试:

from bs4 import BeautifulSoup
import requests

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
    print(item['href'])

执行后,上面的脚本会为这些文件生成四个不同的URL。

2 个答案:

答案 0 :(得分:2)

您可以使用request.get

import requests
from bs4 import BeautifulSoup

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/"
                        "viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select("#latest a"):
    filename = item['href'].split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(requests.get(item['href']).content)

答案 1 :(得分:1)

您可以使用标准库的urllib.request.urlretrieve(),但是,由于您已经在使用requests,因此您可以在此处重新使用会话(download_file主要来自{{3} }}):

from bs4 import BeautifulSoup
import requests


def download_file(session, url):
    local_filename = url.split('/')[-1]

    r = session.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

    return local_filename


with requests.Session() as session:
    response = session.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.select("#latest a"):
        local_filename = download_file(session, item['href'])
        print(f"Downloaded {local_filename}")