Question

我已经设置了一个带有beautifulsoup，selenium（chrome）和python的webscraper。

很简单，beautifulsoup解析一个充满链接的页面，而selenium一次跟踪这些链接1。 Selenium在每个页面上找到一个特定的下载链接并单击它，开始下载。我会使用beautifulsoup，但如果我不使用点击链接的浏览器，它会将我发送到带有验证码的页面。

无论如何，当selenium开始下载文件时，progess会在Chrome屏幕底部显示。但是，在看似随机的时间之后，它表示该文件在中途完成之前仅完成下载。可能有一个8 MB的文件，它只下载500 kb，然后说它已经完成。我不知道为什么它不能完全下载文件，有人知道等待它实际完全下载文件的方法吗？一次可以下载多少文件是否有限制？因为我正在下载相当多的文件。

我不知道该怎么做，我希望有人可以解释一下。

Answer 1

这里你真的不需要selenium。

使用requests和BeautifulSoup就足够了。只需正确设置User-Agent，Host和Referer标题：

from bs4 import BeautifulSoup
import requests

URL = 'url here'

def download_file(url, filename, headers):
    r = requests.get(url, stream=True, headers=headers)
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()

# get link and file name
response = requests.get(URL)
soup = BeautifulSoup(response.content)
a = soup.find('td', text='Download:').next_sibling.a
link = a.get('href')
filename = a.text + '.pdf'

# download file
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36',
    'Host': 'filepi.com',  # host could be extracted from the link
    'Referer': URL
}
download_file(link, filename, headers)

Selenium没有完全下载文件

1 个答案: