如何从R

时间:2017-06-05 08:38:32

标签: r url

我想从R Studio中的文本数据中删除“/ url?q =”。 这是我的谷歌搜索代码:

## Code for Google Search
# Enter Search Term Here
 search.term <- "r-project"
# Creating Function
 getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE) 
{
   # Getting Search Term
    search.term <- gsub(' ', '%20', search.term)
    if(quotes) search.term <- paste('%22', search.term, '%22', sep='') 
   # Putting Search Term in Google Search
    getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }

## Get Links from Google Search
# Creating Function to Get URLs From Search Results
 getGoogleLinks <- function(google.url) {
  # Creating a File to Save URLs
   doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
  # Removing HTML code and Setting Nodes
   html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
   nodes <- getNodeSet(html, "//h3[@class='r']//a")
   return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }

## Remove quoted text, Create URL List
  quotes <- "FALSE"
  search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
  links <- getGoogleLinks(search.url)

## Print URL List
  links

我的结果是:

  

[1]“/ url?q = https://www.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQFggUMAA&usg=AFQjCNEqtiOAIA7OOTa3meWC8zaTjjTy8A
  [2]“/ url?q = http://www.cran.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQjBAIGzAB&usg=AFQjCNF8QmYbLzG0c66QZM2wsXF1n1-9tQ

如何从上面的链接中删除“/ url?q =”?

3 个答案:

答案 0 :(得分:1)

您可以使用gsub。

## Code for Google Search
# Enter Search Term Here
 search.term <- "r-project"
# Creating Function
 getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE) 
{
   # Getting Search Term
    search.term <- gsub(' ', '%20', search.term)
    if(quotes) search.term <- paste('%22', search.term, '%22', sep='') 
   # Putting Search Term in Google Search
    getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }

## Get Links from Google Search
# Creating Function to Get URLs From Search Results
 getGoogleLinks <- function(google.url) {
  # Creating a File to Save URLs
   doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
  # Removing HTML code and Setting Nodes
   html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
   nodes <- getNodeSet(html, "//h3[@class='r']//a")
   return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }

## Remove quoted text, Create URL List
  quotes <- "FALSE"
  search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
  links <- getGoogleLinks(search.url)

## Print URL List
  gsub("/url?q=", "", links)

答案 1 :(得分:1)

我解决了这个问题,因为他们的人物数量有限

links <- substring(links,8)

答案 2 :(得分:0)

除了@ JTeam的答案,您可以尝试这一点(鉴于链接始终以/url?q=开头):

lapply(links,function(x) paste0(strsplit(x,'=')[[1]][-1],collapse = ''))

这为您提供了一个很好的清晰链接列表(如果您喜欢矢量,请尝试sapply