在R Regex中,忽略URL字符串末尾的一些标点符号

时间:2015-10-31 00:56:07

标签: regex r string text-extraction

是否可以使用在URL字符串末尾忽略某些标点符号(而不是" /")的正则表达式函数(即网址末尾的标点符号)提取后的字符串后跟一个空格?在提取网址时,我会在我提取的字符串末尾获取句点,括号,问号和感叹号,例如:

findURL <- function(x){
  m <- gregexpr("http[^[:space:]]+", x, perl=TRUE)
  w <- unlist(regmatches(x,m))
  op <- paste(w,collapse=" ")
  return(op)
}

x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!" 

findURL(x)

[1] http://bit.ly/SS/VUEr).http://bit.ly/14pwinr)? http://bit.ly/108vJOM!

findURL2 <- function(x){
  m <- gregexpr("www[^[:space:]]+", x, perl=TRUE)
  w <- unlist(regmatches(x,m))
  op <- paste(w,collapse=" ")
  return(op)
}


y <-  "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"  

findURL2(y)     

[1] www.example.com/store/locator. www.example.com/Google/Voice. www.example.com/network.  

有没有办法修改这些功能,以便. ) ? !, OR(如果可能)找到). )? )!),在字符串的末尾后跟一个空格(即标点符号:句点,括号,问号,感叹号或逗号在URL字符串末尾后跟空格)以便不提取它们?

1 个答案:

答案 0 :(得分:2)

使用积极的前瞻,你也可以将两者结合起来......

findURL <- function(x){
  m <- gregexpr("\\b(?:www|http)[^[:space:]]+?(?=[^\\s\\w]*(?:\\s|$))", x, perl=TRUE)
  w <- unlist(regmatches(x,m))
  op <- paste(w,collapse=" ")
  return(op)
}

x <- "find out more at http://bit.ly/SS/VUEr). check it out here http://bit.ly/14pwinr)? http://bit.ly/108vJOM! Now!" 
y <-  "This is an www.example.com/store/locator. of the type of www.example.com/Google/Voice. data I'd like to extract www.example.com/network. get it?"

findURL(x)
findURL(y)

# [1] "http://bit.ly/SS/VUEr http://bit.ly/14pwinr http://bit.ly/108vJOM"

# [1] "www.example.com/store/locator www.example.com/Google/Voice www.example.com/network"