如何将字符串连接到列表的每个元素?

时间:2021-07-30 01:44:03

标签: r web-scraping concatenation lapply purrr

如何将链接“https://www.indeed.com”连接到列表“link_2”的每个元素的开头。

我无法获得“https://www.indeed.com”链接以连接我列表中的每一项。

paste("https://www.indeed.com", link_2, sep="")

link_2 有 10 个单独的列表,每个列表大约有 10-15 个列表项。我的目标是在每个 link_2 项目和每个元素的开头添加“www.indeed.com”。

library(tidyverse)
library(rvest)
library(xml2)

url<-"https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&vjk=0c2a6008b4969776"
page<-xml2::read_html(url)#function will read in the code from the webpage and break it down into different elements (<div>, <span>, <p>, etc.


#get job title
title<-page %>%
  html_nodes(".jobTitle") %>%
  html_text()
  
#get company Location
loc<-page %>%
  html_nodes(".companyLocation") %>%
  html_text()

#job snippet
snippet<-page %>%
  html_nodes(".job-snippet") %>%
  html_text()

#Get link 
desc<- page %>%
  html_nodes("a[data-jk]") %>%
  html_attr("href") 

# Create combine link 
combined_link <- paste("https://www.indeed.com", desc, sep="")

#Turn combined link into a session follow link



page1 <-  html_session(combined_link[[1]])
page1 %>%
  html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
  html_text()

#one<- page %>% html_elements("a[id*='job']")

#create function return a list of page-returns

ret <- lapply(paste0("https://www.indeed.com", desc), xml2::read_html)


description<-purrr::map(ret[1:length(ret)], ~ .x %>% 
             html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
             html_text())

#Combine to make Dataframe
c <- cbind(title, loc, snippet, combined_link,description )
View(c)

#Turn the page
# https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&vjk=0c2a6008b4969776
#https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&start=10&vjk=14ba77c18c90585f
#https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&start=20&vjk=2f73a3d9cb046e50



#grab 10 ten pages 1-10, aka 100 results
url_2<- lapply(paste0("https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&start=", sep=seq(10,100,length.out=10)), xml2::read_html)
url_2 #it works!

#get job title
titles_2<-purrr::map(url_2[1:length(url_2)], ~ .x %>% 
                          html_nodes(".jobTitle") %>%
                          html_text())
#get location
loc_2 <- purrr::map(url_2[1:length(url_2)], ~.x %>%
                     html_nodes(".companyLocation") %>%
                     html_text()
                     )

#job snippet
snippet <- purrr::map(url_2[1:length(url_2)], ~.x  %>%
                      html_nodes(".job-snippet") %>%
                      html_text()
)


#Get link 
link_2 <-purrr::map(url_2[1:length(url_2)], ~.x  %>%
                    html_nodes("a[data-mobtk]") %>%
                    html_attr("href")
)

indeed <- "https://www.indeed.com"

1 个答案:

答案 0 :(得分:1)

如果您对所有链接感兴趣,可以unlist link_2 并将它们粘贴在一起。

indeed <- "https://www.indeed.com"
result <- paste0(indeed, unlist(link_2))

如果您想在 link_2 中维护列表结构,请使用 lapply -

result <- lapply(link_2, function(x) paste0(indeed, x))