Question

我已经写了一个代码来刮掉网站上附加的附件。它实质上是将超链接刮到附件。我无法找到一种方法来直接将那些附件保存到本地。

import requests
import pandas as pd 
from requests import get
url = 'https://www.amfiindia.com/research-information/amfi-monthly'
response = get(url,verify=False)
import bs4
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.content,'html.parser')

filetype = '.xls'
excel_sheets = html_soup.find_all('a')

#File name where the links to the excel sheet needs to be saved --> here: "All_Links_2.csv"
destination = open('All_Links_2.csv','wb')

for link in excel_sheets:
    href = link.get('href') + '\n'
    if filetype in href:
        print(href)

有人可以在这里帮忙吗？

Answer 1

使用精美的汤并不是真正要做的事，相反，我们使用urllib库。

import urllib.request

urllib.request.urlretrieve(href, "file.jpg")

这将获取图像地址并将其另存为file.jpg。如果您要使用不同的文件名（适用于您的情况），请创建字符串"file" + i + ".jpg"，并以i作为某个值，您可以递增

Answer 2

如果您只想获取链接，则不需要二进制模式，而且由于您导入了 pandas ，因此可以使用它来保存它们。

首先创建一个数据框：

df = pd.DataFrame([a['href'] for a in excel_sheets if filetype in a['href']])

然后将其保存而不包含列名（header = False）：

df.to_csv('All_Links_2.csv', header=False)

如何使用Beautiful Soup从网站保存附件？

2 个答案: