Question

我有一个使用urllib从存储在熊猫数据框中的HTML链接下载pdf文件的代码。

数据框如下所示。

id   URL
1    https://www.pdf.com/first.pdf
2    https://www.pdf.com/second.pdf
3    https://www.pdf.com/third.pdf
:
:
N    https://www.pdf.com/numberN.pdf

我的代码：

df = pd.read_csv('pdf list.csv')

#convert the URL column into a list
l = df['URL'].to_list()

#loop through the list to download pdf file from the HTML link
for link in l:
    urllib.request.urlretrieve(link, "/Users/CodingStark/pdf/name of the pdf")

我的数据框中有成千上万个HTML链接，因此给每个下载的pdf文件起个名字并不现实。我想知道是否有任何方式可以自动执行此代码，而无需为每个pdf文件命名。这是因为当我打开html链接时，它们已经为该pdf文件指定了名称。我很好奇我可以改用那个名字吗？ Example link

非常感谢您！

Answer 1

df = pd.read_csv('pdf list.csv')

#convert the URL column into a list
l = df['URL'].to_list()

#loop through the list to download pdf file from the HTML link
for link in l:
    filename = link.split('/')[-1]
    urllib.request.urlretrieve(link, f"/Users/CodingStark/pdf/{filename}")

如何用'/'分隔符分割字符串URL

1 个答案: