我正在抓取一个Google Scholar个人资料页面,现在,我从漂亮的汤类库中获取了python代码,该库从该页面中收集数据:
app.post("/login",(req,res)=>{
for(var i=0;i<users.length+1;i++){
if(i<users.length){
if(users[i].name==req.body.username&&users[i].password==req.body.password){
//console.log(req.body)
console.log("attempting to log in!")
// res.render("index.ejs",{data:req.body})
console.log(users[i])
console.log("logged in")
res.render("home.ejs",{user:users[i]})
break;
//means that account has been found
}else if(users[i].name==req.body.username&&users[i].password!=req.body.password){
res.render("errorlogin.ejs",{error:"incorrect password"})
//only one of each name, so account has been found
break;
}else if(users[i].name!=req.body.username&&users[i].password!=req.body.password){
//if neither pw or name match, it means we need to keep searching
continue
}else if(!req.body.password||req.body.password==null){
//if there's no pw... you get it...
res.render("errorlogin.ejs",{error:"incorrect password"})
break;
}
//no need to check for pw, bc it's not unique, the combo, or the name is.
}else{
res.render("errorlogin.ejs",{error:"account does not exist. please register below."})
}
}
})```
我还有selenium库中的python代码,该库代码可以自动执行配置文件页面,以单击“显示更多”按钮:
url = "https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en"
while True:
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')
research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
for research in research_article:
title = research.find('a',{'class':'gsc_a_at'}).text
authors = research.find('div',{'class':'gs_gray'}).text
print('Title:', title,'\n','\nAuthors:', authors)
如何合并这两个代码块,以便可以单击“显示更多”按钮并刮取整个页面?预先感谢!
答案 0 :(得分:0)
此脚本将打印页面上的所有标题和作者:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en'
api_url = 'https://scholar.google.com/citations?user={user}&hl=en&cstart={start}&pagesize={pagesize}'
user_id = re.search(r'user=(.*?)&', url).group(1)
start = 0
while True:
soup = BeautifulSoup( requests.post(api_url.format(user=user_id, start=start, pagesize=100)).content, 'html.parser' )
research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
for i, research in enumerate(research_article, 1):
title = research.find('a',{'class':'gsc_a_at'})
authors = research.find('div',{'class':'gs_gray'})
print('{:04d} {:<80} {}'.format(start+i, title.text, authors.text))
if len(research_article) != 100:
break
start += 100
打印:
0001 Hyper-heuristics: A Survey of the State of the Art EK Burke, M Hyde, G Kendall, G Ochoa, E Ozcan, R Qu
0002 Hyper-heuristics: An emerging direction in modern search technology E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
0003 Search methodologies: introductory tutorials in optimization and decision support techniques E Burke, EK Burke, G Kendall
0004 A tabu-search hyperheuristic for timetabling and rostering EK Burke, G Kendall, E Soubeiga
0005 A hyperheuristic approach to scheduling a sales summit P Cowling, G Kendall, E Soubeiga
0006 A classification of hyper-heuristic approaches EK Burker, M Hyde, G Kendall, G Ochoa, E Özcan, JR Woodward
0007 Genetic algorithms K Sastry, D Goldberg, G Kendall
...
0431 Solution Methodologies for generating robust Airline Schedules F Bian, E Burke, S Jain, G Kendall, GM Koole, J Mulder, MCE Paelinck, ...
0432 A Triple objective function with a chebychev dynamic point specification approach to optimise the surface mount placement machine M Ayob, G Kendall
0433 A Library of Vehicle Routing Problems T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
0434 This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for … S Louis, G Kendall