我在python中编写了一些代码来从网页下载文件。因为我不知道如何从任何网站下载文件,所以我只能从该网站刮取文件链接。如果有人能帮助我实现这一目标,我会非常感激他。非常感谢。
指向该网站的链接:web_link
这是我的尝试:
from bs4 import BeautifulSoup
import requests
response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
print(item['href'])
执行后,上面的脚本会为这些文件生成四个不同的URL。
答案 0 :(得分:2)
您可以使用request.get
:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/"
"viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select("#latest a"):
filename = item['href'].split('/')[-1]
with open(filename, 'wb') as f:
f.write(requests.get(item['href']).content)
答案 1 :(得分:1)
您可以使用标准库的urllib.request.urlretrieve()
,但是,由于您已经在使用requests
,因此您可以在此处重新使用会话(download_file
主要来自{{3} }}):
from bs4 import BeautifulSoup
import requests
def download_file(session, url):
local_filename = url.split('/')[-1]
r = session.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
with requests.Session() as session:
response = session.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
local_filename = download_file(session, item['href'])
print(f"Downloaded {local_filename}")