如何用mutate在dplyr中grep

时间:2018-05-01 00:39:55

标签: r grep dplyr

我想帮助了解我dplyr管道中发生的事情并请求解决此问题的各种方法。

问题

我有一份研究所清单(论文作者来自研究期刊文章的正式用语),我想提取主要的研究所名称。如果它是一所大学,它将是大学。这是我为了简单起见而坚持到这里的例子。

尝试解决方案逻辑

  1. 用逗号分割学院名称
  2. grep for the term" univ"或其他与大学有关的条款清单
  3. 提取有点击的索引
  4. 边缘情况/假设

    • 我要搜索的字词只存在于其中一个分割中
    • 这里所有的学院都是大学(这里的Stack Overflow问题很简单)

    代码

    df %>%
    mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
     head()
    

    假设正在发生但未发生的是我上面写的逻辑。我所看到的是,在mutate中,institute的第一个实例正在df中的每一行进行搜索,并且完全相同的#34;大学新的所有〜"正在填写。我对错误是什么有一个总体的想法,除了不知道它为什么会发生或如何在保持dplyr的同时解决它。如果我使用apply函数,我可以这样做,我很好奇有什么样的答案。

    它看起来像:

    # A tibble: 6 x 2
      institute                                                                          instGuess              
      <chr>                                                                              <chr>                  
    1 school of computer science and engineering, university of new south wales, sydney~ " university of new so~
    2 department computer science, friedrich-alexander-university, erlangen-nuremberg, ~ " university of new so~
    3 department of ece, pesit, bangalore, india                                         " university of new so~
    4 school of information technology and electrical engineering, university of queens~ " university of new so~
    5 school of information technology and electrical engineering, university of queens~ " university of new so~
    6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ri~ " university of new so~
    

    用于示例的数据

    df <- structure(list(institute = c("school of computer science and engineering, university of new south wales, sydney, australia", 
    "department computer science, friedrich-alexander-university, erlangen-nuremberg, germany", 
    "department of ece, pesit, bangalore, india", "school of information technology and electrical engineering, university of queenslandqld, australia", 
    "school of information technology and electrical engineering, university of queenslandold, australia", 
    "dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, singapore 119260, singapore"
    ), instGuess = c(" university of new south wales", " university of new south wales", 
    " university of new south wales", " university of new south wales", 
    " university of new south wales", " university of new south wales"
    )), .Names = c("institute", "instGuess"), row.names = c(NA, -6L
    ), class = c("tbl_df", "tbl", "data.frame"))
    

4 个答案:

答案 0 :(得分:4)

您需要包含group_by语法才能正常工作:

df %>%
  group_by(institute) %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1])

产地:

# A tibble: 6 x 2
# Groups:   institute [6]
institute                                                                  instGuess              
<chr>                                                                      <chr>                  
  1 school of computer science and engineering, university of new south wales… " university of new so…
2 department computer science, friedrich-alexander-university, erlangen-nur… " friedrich-alexander-…
3 department of ece, pesit, bangalore, india                                 NA                     
4 school of information technology and electrical engineering, university o… " university of queens…
5 school of information technology and electrical engineering, university o… " university of queens…
6 dept. of info. syst. and comp. sci., national university of singapore, 10… " national university …

答案 1 :(得分:3)

我认为@Pdubbs的回答是第一好的,他使用group_by来模仿使用rowwise()的@ www的答案,但差异(在我看来明显有利)是在那里重复$institute,每个研究所只做一次这样的猜测就可以获得效率。

这更进了一步,并没有在每个实例上重新strsplit。我将复制第一行:

df <- df[c(1,1:6),]

定义一个完成工作的函数,而不是复制strsplit

find_univ <- function(x) {
  message('*', appendLF=FALSE)
  y <- strsplit(x[[1]], ',')[[1]]
  y[grep('univ', y)][1]
}

(并插入message调用以指示调用它的次数...不包括在生产中),然后是序列:

df %>%
  group_by(institute) %>%
  mutate(instGuess = find_univ(institute)) %>%
  ungroup() %>%
  select(instGuess) # for display purposes only
# ******  <---- six calls on seven rows, benefit of group_by
# A tibble: 7 × 1
#                           instGuess
#                               <chr>
# 1     university of new south wales
# 2     university of new south wales
# 3    friedrich-alexander-university
# 4                              <NA>
# 5       university of queenslandqld
# 6       university of queenslandold
# 7  national university of singapore

我不知道strsplit的重复数据删除是否有影响,但只有在您拥有大量数据时才有用。否则,它只是一个没有"premature optimization"的OCD级效率。

答案 2 :(得分:2)

您可以使用sub

a=df %>%
     group_by(institute)%>%
     mutate(Instname=sub("(.*,\\s|)(.*unive.*?)(,|$).*|.*","\\2",institute))
> a
# A tibble: 6 x 2
# Groups:   institute [6]
  institute                                                                                           Instname                   
  <chr>                                                                                               <chr>                      
1 school of computer science and engineering, university of new south wales, sydney, australia        university of new south wa~
2 department computer science, friedrich-alexander-university, erlangen-nuremberg, germany            friedrich-alexander-univer~
3 department of ece, pesit, bangalore, india                                                          ""                         
4 school of information technology and electrical engineering, university of queenslandqld, australia university of queenslandqld
5 school of information technology and electrical engineering, university of queenslandold, australia university of queenslandold
6 dept. of info. syst. and comp. sci., national university of singapore, 10 kent ridge crescent, sin~ national university of sin~
> a$Instname
[1] "university of new south wales"    "friedrich-alexander-university"   ""                                
[4] "university of queenslandqld"      "university of queenslandold"      "national university of singapore"

答案 3 :(得分:1)

看起来只使用了第一个元素。我们可以使用rowwise对每一行进行分组,并确保操作是特定于行的。

library(dplyr)

df %>%
  rowwise() %>%
  mutate(instGuess = unlist(strsplit(institute, ","))[grep("univ", unlist(strsplit(institute, ",")))][1]) %>%
  ungroup() %>%
  head()
# # A tibble: 6 x 2
# institute                                                              instGuess             
#   <chr>                                                                  <chr>                 
# 1 school of computer science and engineering, university of new south w~ " university of new s~
# 2 department computer science, friedrich-alexander-university, erlangen~ " friedrich-alexander~
# 3 department of ece, pesit, bangalore, india                             NA                    
# 4 school of information technology and electrical engineering, universi~ " university of queen~
# 5 school of information technology and electrical engineering, universi~ " university of queen~
# 6 dept. of info. syst. and comp. sci., national university of singapore~ " national university~