使用RSelenium(和RVest)从LinkedIn抓取数据

时间:2020-09-07 20:57:36

标签: r selenium rvest rselenium

我正在尝试从LinkedIn上的知名人士那里抓取一些数据,但我遇到了一些问题。我想执行以下操作:

  1. 在Hadley Wickhams页面(https://www.linkedin.com/in/hadleywickham/)上,我想使用RSelenium登录并“单击”“显示1项更多的知识”-以及“显示1项更多的经验”(请注意,Hadley会这样做)不能选择“显示1个更多的经验”,但是可以选择“显示1个更多的教育”)。 (通过点击“显示更多的经验/教育”,我可以从页面上获取全部的教育和经验)。另外,特德·克鲁兹(Ted Cruz)可以选择“展示5个更多的体验”,我想扩展和抓取。

代码:

library(RSelenium)
library(rvest)
library(stringr)
library(xml2)

userID = "myEmailLogin" # The linkedIn email to login
passID = "myPassword"   # and LinkedIn password

try(rsDriver(port = 4444L, browser = 'firefox'))
remDr <- remoteDriver()
remDr$open()
remDr$navigate("https://www.linkedin.com/login")

user <- remDr$findElement(using = 'id',"username")
user$sendKeysToElement(list(userID,key="tab"))

pass <- remDr$findElement(using = 'id',"password")
pass$sendKeysToElement(list(passID,key="enter"))

Sys.sleep(5) # give the page time to fully load
# Navgate to individual profiles
# remDr$navigate("https://www.linkedin.com/in/thejlo/") # Jennifer Lopez
# remDr$navigate("https://www.linkedin.com/in/cruzted/") # Ted Cruz
remDr$navigate("https://www.linkedin.com/in/hadleywickham/") # Hadley Wickham 

Sys.sleep(5) # give the page time to fully load
html <- remDr$getPageSource()[[1]]


signals <- read_html(html)

personFullNameLocationXPath <- '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/ul[1]/li[1]'
personName <- signals %>%
  html_nodes(xpath = personFullNameLocationXPath) %>% 
  html_text()

personTagLineXPath = '/html/body/div[9]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2'
personTagLine <- signals %>% 
  html_nodes(xpath = personTagLineXPath) %>% 
  html_text()

personLocationXPath <- '//*[@id="ember49"]/div[2]/div[2]/div[1]/ul[2]/li[1]'
personLocation <- signals %>% 
  html_nodes(xpath = personLocationXPath) %>% 
  html_text()

personLocation %>% 
  gsub("[\r\n]", "", .) %>% 
  str_trim(.)

# Here is where I have problems

personExperienceTotalXPath = '//*[@id="experience-section"]/ul'
personExperienceTotal <- signals %>% 
  html_nodes(xpath = personExperienceTotalXPath) %>% 
  html_text()

最后一个错误personExperienceTotal是我出问题的地方...我似乎无法刮除experience-section。当我放置自己的LinkedIn URL(或一些随机的人)时,它似乎可以工作...

我的问题是,如何单击expand experience/education并刮擦这些部分?

0 个答案:

没有答案