删除除R中的反斜杠之外的所有标点符号

时间:2014-11-26 17:24:11

标签: regex r grep strsplit

我正在尝试从数据集中提取html链接。我正在使用strsplit,然后使用grep来查找带有链接的子字符串,但结果在字符串的开头或结尾处有不需要的字符....如何只提取具有所需模式的字符串或保留字符串期望的模式

他就是我目前正在做的事情。

1)我使用strplit和"分割了一大块文本。 " (空格)作为分隔符

2)接下来我grep strsplit的结果找到模式

e.g。 grep(" https:\ / \ / support.google.com \ / blogger \ / topic \ / [0-9]",r)

3)结果的变化很少如下所示....

https://support.google.com/blogger/topic/12457 
https://support.google.com/blogger/topic/12457.
[https://support.google.com/blogger/topic/12457]  
<<https://support.google.com/blogger/topic/12457>>
https://support.google.com/blogger/topic/12457,
https://support.google.com/blogger/topic/12457),
xxxxxxhttps://support.google.com/blogger/topic/12457),hhhththta
etc...

我如何才能提取&#34; https://support.google.com/blogger/topic/12457&#34;或者在提取脏数据后如何删除不需要的标点符号

提前谢谢。

3 个答案:

答案 0 :(得分:1)

qdapRegex包有一个很棒的函数rm_url,非常适合这个例子。

install.packages('qdapRegex')
library(qdapRegex)

urls <- YOUR_VECTOR_OF_URLS
rm_url(urls, extract = T)

答案 1 :(得分:0)

如果数据在某些时候是HTML,您可以尝试这样做:

library(XML)
urls <- getNodeSet(htmlParse(htmldata), "//a[contains(@href, 'support.google.com')]/@href"))

答案 2 :(得分:0)

使用rex可能会使这类任务变得更简单。

# generate dataset
x <- c(
"https://support.google.com/blogger/topic/12457
https://support.google.com/blogger/topic/12457.
https://support.google.com/blogger/topic/12457] 
<<https://support.google.com/blogger/topic/12457>>
https://support.google.com/blogger/topic/12457,
https://support.google.com/blogger/topic/12457),
xxxxxxhttps://support.google.com/blogger/topic/12457),hhhththta")

# extract urls
# note you don't have to worry about escaping the html string yourself
library(rex)    
re <- rex(
  capture(name = "url",
    "https://support.google.com/blogger/topic/",
    digits
    ))

re_matches(x, re, global = TRUE)[[1]]
#>                                             url
#>1 https://support.google.com/blogger/topic/12457
#>2 https://support.google.com/blogger/topic/12457
#>3 https://support.google.com/blogger/topic/12457
#>4 https://support.google.com/blogger/topic/12457
#>5 https://support.google.com/blogger/topic/12457
#>6 https://support.google.com/blogger/topic/12457
#>7 https://support.google.com/blogger/topic/12457