使用grep过滤网址

时间:2018-01-25 21:50:02

标签: r regex grep

示例数据框:

id url                                              ...                                           
1  www.hello.com/art/dance/article/title1/nothing
2  www.hello.com/dance/nothing
3  www.hello.com/art/dance/article/title2/nothing
4  www.hello.com/art/dance/article/title3/something
5  www.hello.com/art/dance/
6  www.hello.com/art/article/title4/nothing
7  www.hello.com/art/dance/article/title2/nothing
8  www.hello.com/art/dance/article/title3/something
...

我使用grep过滤网址中包含标题的行。这个想法是标记某些网址。我用多个游戏来运行它。

df[grep('.+/TITLE-IM-LOOKING-FOR/.+', clickstream$url, value = FALSE,perl=TRUE),]$label <- "ChoosenLabel"

有没有更好的方法来过滤和标记网址? grep总是最好的选择吗?

输出

id url                                                                 Label                                          
1  www.hello.com/art/dance/article/title1/nothing
2  www.hello.com/dance/nothing
3  www.hello.com/art/dance/article/title2/nothing
4  www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5  www.hello.com/art/dance/
6  www.hello.com/art/article/title4/nothing
7  www.hello.com/art/dance/article/title2/nothing
8  www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel

更新: 发现删除。+会像疯狂一样提高速度

3 个答案:

答案 0 :(得分:2)

在基础R包中进行此操作:

transform(dat,Label=ifelse(grepl("title3",url),"title3",""))

  id                                              url  Label
1  1   www.hello.com/art/dance/article/title1/nothing       
2  2                      www.hello.com/dance/nothing       
3  3   www.hello.com/art/dance/article/title2/nothing       
4  4 www.hello.com/art/dance/article/title3/something title3
5  5                         www.hello.com/art/dance/       
6  6         www.hello.com/art/article/title4/nothing       
7  7   www.hello.com/art/dance/article/title2/nothing       
8  8 www.hello.com/art/dance/article/title3/something title3

答案 1 :(得分:1)

一个选项就是使用grep来获取值。假设您正在寻找“舞蹈”然后尝试:

> grep(".+/dance/.+", df$url, value = TRUE)
[1] "`www.hello.com/art/dance/article/title1/nothing`"
[2] "www.hello.com/dance/nothing"                     
[3] "www.hello.com/art/dance/article/title2/nothing"  
[4] "www.hello.com/art/dance/article/title3/something"
[5] "www.hello.com/art/dance/article/title2/nothing"  
[6] "www.hello.com/art/dance/article/title3/something"

另一个例子可能是:

> grep(".+/title3/.+", df$url, value = TRUE)
[1] "www.hello.com/art/dance/article/title3/something"
[2] "www.hello.com/art/dance/article/title3/something"

答案 2 :(得分:1)

选项1 使用dplyr:

# Create data
clickstream <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
id url                                                                           
1  www.hello.com/art/dance/article/title1/nothing
2  www.hello.com/dance/nothing
3  www.hello.com/art/dance/article/title2/nothing
4  www.hello.com/art/dance/article/title3/something
5  www.hello.com/art/dance/
6  www.hello.com/art/article/title4/nothing
7  www.hello.com/art/dance/article/title2/nothing
8  www.hello.com/art/dance/article/title3/something")

# Your pattern
regex <- "+./title3/+"
replacement <- "/TITLE-IM-LOOKING-FOR/"

# computation
library(dplyr)
clickstream %>%
  mutate(label = if_else(grepl(regex, .$url), "ChoosenLabel", "")) %>%
  mutate(url = if_else(label != "", gsub(regex, replacement, url), url))

输出:

  id                                                           url        label
1  1                www.hello.com/art/dance/article/title1/nothing             
2  2                                   www.hello.com/dance/nothing             
3  3                www.hello.com/art/dance/article/title2/nothing             
4  4 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5  5                                      www.hello.com/art/dance/             
6  6                      www.hello.com/art/article/title4/nothing             
7  7                www.hello.com/art/dance/article/title2/nothing             
8  8 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
使用data.table(相同输出)

选项2

library(data.table)
dt <- setDT(clickstream)
dt[, label := if_else(grepl(regex, url), "ChoosenLabel", "")]
dt[label != "", url := gsub(regex, replacement, url)]