Question

我想删除网址中第一个?字符后的所有内容。我的示例数据中的6行中有3行包含?字符;其他3个都可以。

structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/", 
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http", 
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/", 
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http", 
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL", 
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")

我试过了：

df1$URL<-sub("?:.*$","",df1$URL)

这似乎没有效果。

我也尝试过：

df1$URL<-sapply(str_split(df1$URL,"?"),"[",1)

，这会生成错误消息。

第三次尝试：

df1$URL<-sapply(strsplit(df1$URL,"?"),"[",1)

从我的网址字段中删除了除正斜杠之外的所有内容。

Answer 1

您可以并且应该使用特定于URL的工具来处理URL。 urltools包有一些现成的东西：

library(urltools)

dat <- structure(list(URL = c("/2015/08/10/five-great-fantasy-books-most-fans-dont-know-exist/", 
"/2015/09/25/animated-dune-matt-rhodes-concept-art/", "/2015/09/09/the-dogs-of-athens-kendare-blake/?et_cid=34295599&et_rid=1476556397&linkid=http", 
"/2015/06/16/spin-the-wheel-1-the-wheel-of-time-companion/comment-page-4/", 
"/2015/06/29/excerpt-brandon-sanderson-shadows-of-self-prologue/?et_cid=34326143&et_rid=1724499137&linkid=http", 
"/2015/08/12/milagroso-isabel-yap/?et_cid=34174778&et_rid=559408553&linkid=http"
), Pageviews = c(100L, 200L, 113L, 100L, 50L, 13L)), .Names = c("URL", 
"Pageviews"), row.names = c(NA, -6L), class = "data.frame")


url_parse(dat$URL)$path

Answer 2

您需要转义?，因为?是正则表达式中的特殊元字符。

df1$URL <- sub("\\?.*","",df1$URL)

删除＆＃34;？＆＃34;之后的所有内容使用R从数据框中的URL

2 个答案: