Question

我正在使用os.system('wget '+ link)从网站中检索文件。下载后，我想根据源链接进一步处理这些文件。

大多数链接都是这种形式 htttp://example.com/.../filename.zip。
在这种情况下，文件只需下载为filename.zip。我可以使用basename和RegExp [^/]+$从链接中提取。

但问题是形式的链接

http://http://www.ez-robot.com
http://www.worldscientific.com/
http://www.fairweld.com

这些链接的下载方式为index.html，index.html.1，index.html.2等。
在这里，我无法区分哪个index文件属于哪个网站。我可以这样做的一种方法是查看链接传递给wget的顺序。

我想要一些通用方法来获取在计算机中下载文件的“真实”文件名。当wget完成执行时，它会在终端上显示Saving to:标签，后跟该“真实”文件名。我想将该文件名存储在一个字符串中。

是否存在任何直接/更简单的方法来获取文件名？我正在使用Python。

$ wget http://www.fairweld.com
--2015-04-11 18:51:48--  http://www.fairweld.com/
Connecting to 202.142.81.24:3124... connected.
Proxy request sent, awaiting response... 200 OK
Length: 39979 (39K) [text/html]
Saving to: ‘index.html.4

Answer 1

使用os.path.basename并根据url的结尾获取名称，您也可以使用请求下载html：

links = ["http://www.ez-robot.com",
"http://www.worldscientific.com/",
"http://www.fairweld.com"]


import urlparse
import requests
import os
for link in links:
    r = requests.get(link)
    if link.rsrip("/").endswith(".com"):
        name = os.path.basename(link)
    else:
        name = urlparse.urlsplit(link.path.split("/")[-1])
    with open("{}.html".format(name),"w") as f:
        f.write(r.content)

Answer 2

您遇到的问题是因为文件名已经存在。我建议下载每个文件＆＃39;到新文件夹（即域名）以防止重复。

$ wget --directory-prefix=$DOMAIN $URL

这将保留原始文件名，如数据标题中所指定。

还有一个提示，您正在使用os.system('wget '+ link)这可能非常不安全，因为您没有在此处清理输入。输入可能会被注入，这会使您的系统运行不需要的命令。详细了解Bobby Tables。

如何获取wget下载的文件的文件名

2 个答案: