如何在字符串中给出一些模式来替换/重命名行?

时间:2017-04-19 13:35:00

标签: r

我想转换一些与唯一用户ID相对应的链接:

    df<- data.frame(

      employeeId = c(1,2,3,4,5,6),
      linkToEmployee = c("http://intranet.homepageEmploye.com/herSalary",
                       "http://intranet.homepageEmploye.org/herSalary/Details",
                       "http://local.com/qa/for",
                       "here the homepage is missing",
                       "http://local.org/",
                       "here the homepage is missing"))


         employeeId                       linkToEmployee

    1          1         http://intranet.homepageEmploye.com/herSalary
    2          2 http://intranet.homepageEmploye.org/herSalary/Details
    3          3                               http://local.com/qa/for
    4          4                          here the homepage is missing
    5          5                                     http://local.org/
    6          6                          here the homepage is missing

现在我想将这些链接转换为该表单:

    desired<- data.frame(

        employeeId = c(1,2,3,4,5,6),
        linkToEmployee = c("intranet.com",
                           "intranet.org",
                           "local.com",
                           "here",
                           "local.org",
                           "here"))


            employeeId linkToEmployee

      1          1   intranet.com
      2          2   intranet.org
      3          3      local.com
      4          4           here
      5          5      local.org
      6          6           here

我曾尝试将gsub用于Intranet的情况,但似乎没有按预期工作。

    df$linkToEmployee <- gsub("http://intranet.homepageEmploye.com/", "intranet.com.", df$linkToEmployee)

然而,这不能按预期工作

1 个答案:

答案 0 :(得分:1)

执行此操作的一种方法是使用包urltools,它具有一些非常有用的URL解析功能。首先,您需要找出哪些确实是URL。为此,我搜索了包含TLD的字符串。

library(urltools)

ind <- !is.na(suffix_extract(domain(df$linkToEmployee))$suffix)

df$linkToEmployee[ind] <- sapply(strsplit(domain(df$linkToEmployee[ind]), '\\.|\\s+'), 
                                      function(i) paste(head(i, 1), tail(i, 1), sep = '.'))

df$linkToEmployee[!ind] <- gsub('\\s+.*', '', df$linkToEmployee[!ind])

df
#  employeeId linkToEmployee
#1          1   intranet.com
#2          2   intranet.org
#3          3      local.com
#4          4           here
#5          5      local.org
#6          6           here

注意

确保您的网址变量不是因素。使用stringsAsFactors = FALSE读取数据或执行

df$linkToEmployee <- as.character(df$linkToEmployee)