字段中按字符分隔变量

时间:2018-11-15 11:22:14

标签: r regex tidyr

我最近问了这个问题 Separate contents of field 并得到了一个非常快速,非常简单的答案。

我可以在Excel中做的简单的事情是看一个单元格,找到一个字符的第一个实例,然后将所有字符返回到它的左侧。

例如

  

作者

     

Drijgers RL,Verhey FR,Leentjens AF,Kahler S,Aalten P。

我可以将Drijgers RL和Aalten P提取到excel中的单独列中。这使我可以算出某人既是第一作者又是最后作者的次数。

如何在R中复制它?我可以从上面单独的行中算出作者发表出版物的总次数。

我如何将第一作者和最后一位作者分开。知道这可能很有用。在这个答案中Splitting column by separator from right to left in R

列数是已知的。怎么说“以逗号分隔此字符串,并根据原始字段右侧的作者列表中的名称数将它们扔进未知数的列中”?

2 个答案:

答案 0 :(得分:2)

尝试此功能:

extract_authors <- function(df, authors) {

  df[["FirstAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(",.*", "", df[[authors]])), df[[authors]]
  )


  df[["LastAuthor"]] <- ifelse(
    grepl(",", df[[authors]]), trimws(gsub(".*,", "", df[[authors]])), "No last author"
  )

  return(df)

}

使用该主题的另一个示例:

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

您可以这样称呼它:

extract_authors(df, "authors")

在输出中,您将获得2个新列,FirstAuthorLastAuthor

                                                    authors FirstAuthor     LastAuthor
1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL      Aalten P.
2            Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL       Kahler S
3                      Drijgers RL, Verhey FR, Leentjens AF Drijgers RL   Leentjens AF
4                                    Drijgers RL, Verhey FR Drijgers RL      Verhey FR
5                                               Drijgers RL Drijgers RL No last author

答案 1 :(得分:1)

exec

以上是在性能方面糟糕的。我制作了一个data.frame( authors = c( "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.", "Drijgers RL, Verhey FR, Leentjens AF, Kahler S", "Drijgers RL, Verhey FR, Leentjens AF", "Drijgers RL, Verhey FR", "Drijgers RL" ), stringsAsFactors = FALSE ) -> sample_df cbind.data.frame( # add the columns to the original data frame after the do.cal() completes sample_df, do.call( # turn the list created with lapply below into a data frame rbind.data.frame, lapply( strsplit(sample_df$authors, ", "), # split at comma+space function(x) { data.frame( # pull first/last into a data frame first = x[1], last = if (length(x) < 2) NA_character_ else x[length(x)], # NA last if only one author stringsAsFactors = FALSE ) } ) ) ) ## authors first last ## 1 Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P. Drijgers RL Aalten P. ## 2 Drijgers RL, Verhey FR, Leentjens AF, Kahler S Drijgers RL Kahler S ## 3 Drijgers RL, Verhey FR, Leentjens AF Drijgers RL Leentjens AF ## 4 Drijgers RL, Verhey FR Drijgers RL Verhey FR ## 5 Drijgers RL Drijgers RL <NA> 匹配组提取版本,但arg0naut的 still 速度,并且我也优化了arg0naut的功能,因为只需要在左侧剥离空格:

stringi

结果:

library(stringi)

data.frame(
  authors = c(
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S, Aalten P.",
    "Drijgers RL, Verhey FR, Leentjens AF, Kahler S",
    "Drijgers RL, Verhey FR, Leentjens AF",
    "Drijgers RL, Verhey FR",
    "Drijgers RL"
  ),
  stringsAsFactors = FALSE
) -> sample_df

# make some copies since we're modifying in-place now
s1 <- s2 <- sample_df

microbenchmark::microbenchmark(

  stri_regex = {
    s1$first <-  stri_match_first_regex(s1$authors, "^([^,]+)")[,2]
    s1$last <- stri_trim_left(stri_match_last_regex(s1$authors, "([^,]+)$")[,2])
    s1$last <- ifelse(s1$last == s1$first, NA_character_, s1$last)
  },

  extract_authors = {
    s2[["first"]] <- ifelse(
      grepl(",", s2[["authors"]]), gsub(",.*", "", s2[["authors"]]), s2[["authors"]]
    )
    s2[["last"]] <- ifelse(
      grepl(",", s2[["authors"]]), trimws(gsub(".*,", "", s2[["authors"]]), "left"), NA_character_
    )

  }

)