我想从R Studio中的文本数据中删除“/ url?q =”。 这是我的谷歌搜索代码:
## Code for Google Search
# Enter Search Term Here
search.term <- "r-project"
# Creating Function
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
# Getting Search Term
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
# Putting Search Term in Google Search
getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }
## Get Links from Google Search
# Creating Function to Get URLs From Search Results
getGoogleLinks <- function(google.url) {
# Creating a File to Save URLs
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
# Removing HTML code and Setting Nodes
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//h3[@class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }
## Remove quoted text, Create URL List
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
## Print URL List
links
我的结果是:
[1]“/ url?q = https://www.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQFggUMAA&usg=AFQjCNEqtiOAIA7OOTa3meWC8zaTjjTy8A”
[2]“/ url?q = http://www.cran.r-project.org/&sa=U&ved=0ahUKEwj78ZWXoabUAhUcTI8KHaTEDTIQjBAIGzAB&usg=AFQjCNF8QmYbLzG0c66QZM2wsXF1n1-9tQ”
如何从上面的链接中删除“/ url?q =”?
答案 0 :(得分:1)
您可以使用gsub。
## Code for Google Search
# Enter Search Term Here
search.term <- "r-project"
# Creating Function
getGoogleURL <- function(search.term, domain = '.co.in', quotes=TRUE)
{
# Getting Search Term
search.term <- gsub(' ', '%20', search.term)
if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
# Putting Search Term in Google Search
getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') }
## Get Links from Google Search
# Creating Function to Get URLs From Search Results
getGoogleLinks <- function(google.url) {
# Creating a File to Save URLs
doc <- getURL(google.url, httpheader = c("User-Agent" = "R(3.4.0)"))
# Removing HTML code and Setting Nodes
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){})
nodes <- getNodeSet(html, "//h3[@class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])) }
## Remove quoted text, Create URL List
quotes <- "FALSE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)
links <- getGoogleLinks(search.url)
## Print URL List
gsub("/url?q=", "", links)
答案 1 :(得分:1)
我解决了这个问题,因为他们的人物数量有限
links <- substring(links,8)
答案 2 :(得分:0)
除了@ JTeam的答案,您可以尝试这一点(鉴于链接始终以/url?q=
开头):
lapply(links,function(x) paste0(strsplit(x,'=')[[1]][-1],collapse = ''))
这为您提供了一个很好的清晰链接列表(如果您喜欢矢量,请尝试sapply
)