我想转换一些与唯一用户ID相对应的链接:
df<- data.frame(
employeeId = c(1,2,3,4,5,6),
linkToEmployee = c("http://intranet.homepageEmploye.com/herSalary",
"http://intranet.homepageEmploye.org/herSalary/Details",
"http://local.com/qa/for",
"here the homepage is missing",
"http://local.org/",
"here the homepage is missing"))
employeeId linkToEmployee
1 1 http://intranet.homepageEmploye.com/herSalary
2 2 http://intranet.homepageEmploye.org/herSalary/Details
3 3 http://local.com/qa/for
4 4 here the homepage is missing
5 5 http://local.org/
6 6 here the homepage is missing
现在我想将这些链接转换为该表单:
desired<- data.frame(
employeeId = c(1,2,3,4,5,6),
linkToEmployee = c("intranet.com",
"intranet.org",
"local.com",
"here",
"local.org",
"here"))
employeeId linkToEmployee
1 1 intranet.com
2 2 intranet.org
3 3 local.com
4 4 here
5 5 local.org
6 6 here
我曾尝试将gsub
用于Intranet的情况,但似乎没有按预期工作。
df$linkToEmployee <- gsub("http://intranet.homepageEmploye.com/", "intranet.com.", df$linkToEmployee)
然而,这不能按预期工作
答案 0 :(得分:1)
执行此操作的一种方法是使用包urltools
,它具有一些非常有用的URL解析功能。首先,您需要找出哪些确实是URL。为此,我搜索了包含TLD的字符串。
library(urltools)
ind <- !is.na(suffix_extract(domain(df$linkToEmployee))$suffix)
df$linkToEmployee[ind] <- sapply(strsplit(domain(df$linkToEmployee[ind]), '\\.|\\s+'),
function(i) paste(head(i, 1), tail(i, 1), sep = '.'))
df$linkToEmployee[!ind] <- gsub('\\s+.*', '', df$linkToEmployee[!ind])
df
# employeeId linkToEmployee
#1 1 intranet.com
#2 2 intranet.org
#3 3 local.com
#4 4 here
#5 5 local.org
#6 6 here
注意
确保您的网址变量不是因素。使用stringsAsFactors = FALSE
读取数据或执行
df$linkToEmployee <- as.character(df$linkToEmployee)