一起使用硒和美丽汤

时间:2020-08-09 03:05:52

标签: python selenium web-scraping beautifulsoup

我正在抓取一个Google Scholar个人资料页面,现在,我从漂亮的汤类库中获取了python代码,该库从该页面中收集数据:

app.post("/login",(req,res)=>{
for(var i=0;i<users.length+1;i++){
    if(i<users.length){
    if(users[i].name==req.body.username&&users[i].password==req.body.password){
         //console.log(req.body)
         console.log("attempting to log in!")
      // res.render("index.ejs",{data:req.body})
      console.log(users[i])
      console.log("logged in")
      res.render("home.ejs",{user:users[i]})
      break;
      //means that account has been found
      }else if(users[i].name==req.body.username&&users[i].password!=req.body.password){
        res.render("errorlogin.ejs",{error:"incorrect password"})
        //only one of each name, so account has been found
        break;
      }else if(users[i].name!=req.body.username&&users[i].password!=req.body.password){
          //if neither pw or name match, it means we need to keep searching
          
          continue
    }else if(!req.body.password||req.body.password==null){
//if there's no pw... you get it...
        res.render("errorlogin.ejs",{error:"incorrect password"})
        break;
    }
    //no need to check for pw, bc it's not unique, the combo, or the name is.
}else{
        res.render("errorlogin.ejs",{error:"account does not exist. please register below."})
}
}
})```

我还有selenium库中的python代码,该库代码可以自动执行配置文件页面,以单击“显示更多”按钮:

url = "https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en"
while True:

    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')
    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
    
    for research in research_article:
        
        title = research.find('a',{'class':'gsc_a_at'}).text 
        authors = research.find('div',{'class':'gs_gray'}).text
    
        print('Title:', title,'\n','\nAuthors:', authors)

如何合并这两个代码块,以便可以单击“显示更多”按钮并刮取整个页面?预先感谢!

1 个答案:

答案 0 :(得分:0)

此脚本将打印页面上的所有标题和作者:

import re
import requests
from bs4 import BeautifulSoup


url = 'https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en'
api_url = 'https://scholar.google.com/citations?user={user}&hl=en&cstart={start}&pagesize={pagesize}'
user_id = re.search(r'user=(.*?)&', url).group(1)

start = 0
while True:
    soup = BeautifulSoup( requests.post(api_url.format(user=user_id, start=start, pagesize=100)).content, 'html.parser' )

    research_article = soup.find_all('tr',{'class':'gsc_a_tr'})

    for i, research in enumerate(research_article, 1):
        title = research.find('a',{'class':'gsc_a_at'})
        authors = research.find('div',{'class':'gs_gray'})

        print('{:04d} {:<80} {}'.format(start+i, title.text, authors.text))

    if len(research_article) != 100:
        break

    start += 100

打印:

0001 Hyper-heuristics: A Survey of the State of the Art                               EK Burke, M Hyde, G Kendall, G Ochoa, E Ozcan, R Qu
0002 Hyper-heuristics: An emerging direction in modern search technology              E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
0003 Search methodologies: introductory tutorials in optimization and decision support techniques E Burke, EK Burke, G Kendall
0004 A tabu-search hyperheuristic for timetabling and rostering                       EK Burke, G Kendall, E Soubeiga
0005 A hyperheuristic approach to scheduling a sales summit                           P Cowling, G Kendall, E Soubeiga
0006 A classification of hyper-heuristic approaches                                   EK Burker, M Hyde, G Kendall, G Ochoa, E Özcan, JR Woodward
0007 Genetic algorithms                                                               K Sastry, D Goldberg, G Kendall

...

0431 Solution Methodologies for generating robust Airline Schedules                   F Bian, E Burke, S Jain, G Kendall, GM Koole, J Mulder, MCE Paelinck, ...
0432 A Triple objective function with a chebychev dynamic point specification approach to optimise the surface mount placement machine M Ayob, G Kendall
0433 A Library of Vehicle Routing Problems                                            T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
0434 This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for … S Louis, G Kendall
相关问题