我想删除网址中第一个?
字符后的所有内容。我的示例数据中的6行中有3行包含?
字符;其他3个都可以。
structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/",
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http",
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/",
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http",
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL",
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")
我试过了:
df1$URL<-sub("?:.*$","",df1$URL)
这似乎没有效果。
我也尝试过:
df1$URL<-sapply(str_split(df1$URL,"?"),"[",1)
,这会生成错误消息。
第三次尝试:
df1$URL<-sapply(strsplit(df1$URL,"?"),"[",1)
从我的网址字段中删除了除正斜杠之外的所有内容。
答案 0 :(得分:2)
您可以并且应该使用特定于URL的工具来处理URL。 urltools
包有一些现成的东西:
library(urltools)
dat <- structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/",
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http",
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/",
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http",
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL",
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")
url_parse(dat$URL)$path
答案 1 :(得分:1)
您需要转义?
,因为?
是正则表达式中的特殊元字符。
df1$URL <- sub("\\?.*","",df1$URL)