Question

我使用BeautifulSoup和urllib编写了一个脚本，它遍历URL列表并下载某些文件类型的项目。

我遍历一个URL列表，从每个URL中创建一个汤对象并解析链接。

我遇到的问题是，我发现有时源中的链接是不同的，即使我正在处理的所有链接都在同一个网站中。例如，有时它会是'/dir/pdfs/file.pdf'或'pdf/file.pdf'或'/pdfs/file.pdf'。

因此，如果有完整的URL，urlretrieve()知道如何处理它，但如果它只是上面列出的子目录，则会返回错误。我当然可以手动关注来自源的链接，但urlretrieve()不知道如何处理它，因此我必须添加基本网址（例如www.example.com/或www.example.com/dir/ ）urlretrieve()来电。

我在创建一种情况时遇到问题，如果下载失败，它将尝试添加不同的基本URL，直到它工作，打印URL，如果它们都不起作用，请打印出有问题的文件的错误消息所以我可以手动抓住它。

有人能指出我正确的方向吗？

URLs = []
BASEURL = []
FILETYPE = ['\.pdf$','\.ppt$', '\.pptx$', '\.doc$', 
            '\.docx$', '\.xls$', '\.xlsx$', '\.wmv$']

def main():
for link in soup.findAll(href = compile(types)):
    file = link.get('href')
    filename = file.split('/')[-1]

    urlretrieve(filename)
    print file

if __name__ == "__main__":
for url in URLs:
    html_data = urlopen(url)
    soup = BeautifulSoup(html_data)

    for types in FILETYPE:
        main()

Answer 1

Assumin下载方法将下载文件并在成功下载时返回True，如果失败则返回False ...然后这将通过url和文件给出的所有可能的文件路径。

def download(url, file):
    print url + file;
    //assuming download failed, returning False, so it will loop through all the files for this demo purpose.
    return False;

def main():
    urls = ["example.com/", "example.com/docs/", "example.com/dir/docs/", "example.com/dir/doocs/files/"]

    files = ["file1.pdf", "file2.pdf", "file3.pdf"]

    for file in files:
        for url in urls:
            success = download(url, file, False)
            if success:
                 break


main()

Answer 2

更好的选择是构建正确的绝对URL以开始：

def main(soup, domain, path, types):
    for link in soup.findAll(href = compile(types)):
        file = link.get('href')

        # Make file URL absolute here
        if '://' not in file and not file.startswith('//'):
            if not file.startswith('/'):
                file = urlparse.urljoin(path, file)
            file = urlparse.urljoin(domain, file)

        try:
            urlretrieve(file)
        except:
            print 'Error retrieving %s using URL %s' % (
                link.get('href'), file)

for url in URLs:
    html_data = urlopen(url)
    soup = BeautifulSoup(html_data)

    urlinfo = urlparse.urlparse(url)
    domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
    path = urlinfo.path.rsplit('/', 1)[0]

    for types in FILETYPE:
        main(soup, domain, path, types)

urlparse函数用于将源URL拆分为两个段：domain包含URI方案，域名path包含目标文件的“目录”服务器。例如：

>>> url = "http://www.example.com/some/web/page.html"
>>> urlinfo = urlparse.urlparse(url)
>>> urlinfo
ParseResult(scheme='http', netloc='www.example.com',
            path='/some/web/page.html', params='', query='', fragment='')
>>> domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
>>> domain
'http://www.example.com'
>>> path = urlinfo.path.rsplit('/', 1)[0]
>>> path
'/some/web'

然后domain和path用作遇到的href的基本路径：

如果href包含"://"或以"//"开头，则假设它是绝对的：无需修改，
如果href以"/"开头，则它相对于域：prepend the domain，
否则href与路径相关：前置域和基本路径。

Answer 3

您需要捕获异常并尝试下一个基本网址。也就是说，您也可以在发出请求之前尝试make the links absolute。我认为这是最好的方法，因为它避免了大量不必要的请求。 lxml has a handy make_links_absolute() function为此目的。

此外，请查看urlparse.urljoin。继续你已经使用的方法......

html_data = urlopen(url)
soup = BeautifulSoup(html_data)
for link in soup.findAll(href = compile(types)):
    file = link.get('href')
    for domain in (url, 'http://www.one.com', 'http://www.two.com'):
        path = urlparse.urljoin(domain, file)
        try:
            req = urllib.urlretrieve(url)
            break  # stop trying new domains
        except:
            print 'Error downloading {0}'.format(url)
            # will go to the next domain

如果我使用lxml执行此操作，则类似于：

req = urlopen(url)
html = req.read()
root = lxml.html.fromstring(root)
root.make_links_absolute()  # automatically add the domain to the links
for a in root.iterlinks():
    if a[2].endswith('pdf'):
        # download link ending with pdf
        req = urlretrieve(a[2])

问题迭代可能的URL列表以下载文件

3 个答案: