示例数据框:
id url ...
1 www.hello.com/art/dance/article/title1/nothing
2 www.hello.com/dance/nothing
3 www.hello.com/art/dance/article/title2/nothing
4 www.hello.com/art/dance/article/title3/something
5 www.hello.com/art/dance/
6 www.hello.com/art/article/title4/nothing
7 www.hello.com/art/dance/article/title2/nothing
8 www.hello.com/art/dance/article/title3/something
...
我使用grep过滤网址中包含标题的行。这个想法是标记某些网址。我用多个游戏来运行它。
df[grep('.+/TITLE-IM-LOOKING-FOR/.+', clickstream$url, value = FALSE,perl=TRUE),]$label <- "ChoosenLabel"
有没有更好的方法来过滤和标记网址? grep总是最好的选择吗?
输出
id url Label
1 www.hello.com/art/dance/article/title1/nothing
2 www.hello.com/dance/nothing
3 www.hello.com/art/dance/article/title2/nothing
4 www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5 www.hello.com/art/dance/
6 www.hello.com/art/article/title4/nothing
7 www.hello.com/art/dance/article/title2/nothing
8 www.hello.com/art/dance/article/TITLE-IM-LOOKING-FOR/something ChoosenLabel
更新: 发现删除。+会像疯狂一样提高速度
答案 0 :(得分:2)
在基础R包中进行此操作:
transform(dat,Label=ifelse(grepl("title3",url),"title3",""))
id url Label
1 1 www.hello.com/art/dance/article/title1/nothing
2 2 www.hello.com/dance/nothing
3 3 www.hello.com/art/dance/article/title2/nothing
4 4 www.hello.com/art/dance/article/title3/something title3
5 5 www.hello.com/art/dance/
6 6 www.hello.com/art/article/title4/nothing
7 7 www.hello.com/art/dance/article/title2/nothing
8 8 www.hello.com/art/dance/article/title3/something title3
答案 1 :(得分:1)
一个选项就是使用grep
来获取值。假设您正在寻找“舞蹈”然后尝试:
> grep(".+/dance/.+", df$url, value = TRUE)
[1] "`www.hello.com/art/dance/article/title1/nothing`"
[2] "www.hello.com/dance/nothing"
[3] "www.hello.com/art/dance/article/title2/nothing"
[4] "www.hello.com/art/dance/article/title3/something"
[5] "www.hello.com/art/dance/article/title2/nothing"
[6] "www.hello.com/art/dance/article/title3/something"
另一个例子可能是:
> grep(".+/title3/.+", df$url, value = TRUE)
[1] "www.hello.com/art/dance/article/title3/something"
[2] "www.hello.com/art/dance/article/title3/something"
答案 2 :(得分:1)
选项1 使用dplyr:
# Create data
clickstream <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
id url
1 www.hello.com/art/dance/article/title1/nothing
2 www.hello.com/dance/nothing
3 www.hello.com/art/dance/article/title2/nothing
4 www.hello.com/art/dance/article/title3/something
5 www.hello.com/art/dance/
6 www.hello.com/art/article/title4/nothing
7 www.hello.com/art/dance/article/title2/nothing
8 www.hello.com/art/dance/article/title3/something")
# Your pattern
regex <- "+./title3/+"
replacement <- "/TITLE-IM-LOOKING-FOR/"
# computation
library(dplyr)
clickstream %>%
mutate(label = if_else(grepl(regex, .$url), "ChoosenLabel", "")) %>%
mutate(url = if_else(label != "", gsub(regex, replacement, url), url))
输出:
id url label
1 1 www.hello.com/art/dance/article/title1/nothing
2 2 www.hello.com/dance/nothing
3 3 www.hello.com/art/dance/article/title2/nothing
4 4 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
5 5 www.hello.com/art/dance/
6 6 www.hello.com/art/article/title4/nothing
7 7 www.hello.com/art/dance/article/title2/nothing
8 8 www.hello.com/art/dance/articl/TITLE-IM-LOOKING-FOR/something ChoosenLabel
使用data.table(相同输出)选项2 :
library(data.table)
dt <- setDT(clickstream)
dt[, label := if_else(grepl(regex, url), "ChoosenLabel", "")]
dt[label != "", url := gsub(regex, replacement, url)]