我已经为网络刮板程序编写了如下代码(在python中)-
import requests, bs4 #you probably need to install requests and bs4, just go online and type beautiful soup 4 installation and requests installation
link_list = []
res = requests.get('https://scholar.google.com/scholar?start=0&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
for x in range(1,100):
i = str(x*10)
url = f'https://scholar.google.com/scholar?start={i}&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1'
res_2 = requests.get(url)
soup = bs4.BeautifulSoup(res_2.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
if(link_list):
for x in range(0,len(link_list)):
res_3 = requests.get(link_list[x])
with open(f'/Users/atharvanaik/Desktop/Cursed/{x}.pdf', 'wb') as f: #parameter 1 of the open function is set to a file path that is available only on my computer
f.write(res_3.content)
print(x) #Set it to something that is accessible on your computer.
else: #Your final path should be -
print('sorry, unavailable') #Something\something\something\{x}.pdf
#Do not change the last part !
就上下文而言,我正尝试从Google学术搜索中批量下载pdf,而不是手动进行。
我设法下载了绝大多数pdf,但是当我尝试打开它们时,有些pdf给了我这个信息-
“该文件可能已损坏,或使用了预览无法识别的文件格式。”
如上面的代码所示,我正在使用请求来下载内容并写入文件。有办法解决这个问题吗?