有没有其他方法可以从互联网上下载pdf文件而不会损坏它们?

时间:2020-02-23 05:47:07

标签: python-3.x web-scraping

我已经为网络刮板程序编写了如下代码(在python中)-

import requests, bs4  #you probably need to install requests and bs4, just go online and type beautiful soup 4 installation and requests installation


link_list = []
res = requests.get('https://scholar.google.com/scholar?start=0&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
    if('pdf' in link.get('href')):
        link_list.append(link.get('href'))

for x in range(1,100):
    i = str(x*10)
    url = f'https://scholar.google.com/scholar?start={i}&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1'
    res_2 = requests.get(url)
    soup = bs4.BeautifulSoup(res_2.text, 'html.parser')
    for link in soup.find_all('a'):
        if('pdf' in link.get('href')):
           link_list.append(link.get('href'))


if(link_list):
    for x in range(0,len(link_list)):
        res_3 = requests.get(link_list[x])
        with open(f'/Users/atharvanaik/Desktop/Cursed/{x}.pdf', 'wb') as f: #parameter 1 of the open function is set to a file path that is available only on my computer
            f.write(res_3.content)
        print(x)                                         #Set it to something that is accessible on your computer. 
else:                                                                       #Your final path should be - 
    print('sorry, unavailable')                                             #Something\something\something\{x}.pdf
                                                                        #Do not change the last part !

就上下文而言,我正尝试从Google学术搜索中批量下载pdf,而不是手动进行。

我设法下载了绝大多数pdf,但是当我尝试打开它们时,有些pdf给了我这个信息-

“该文件可能已损坏,或使用了预览无法识别的文件格式。”

如上面的代码所示,我正在使用请求来下载内容并写入文件。有办法解决这个问题吗?

0 个答案:

没有答案