根据R中的正则表达式删除部分字符串

时间:2019-01-28 07:05:14

标签: r regex stringr

说我有一个向量字符串,如下所示:

vector<-c("hi, how are you doing?", 
           "what time is it?", 
           "the sky is blue", 
           "hi, how are you doing today? You seem tired.", 
           "walk the dog", 
           "the grass is green", 
           "the sky is blue during the day")

vector
[1] "hi, how are you doing?"                      
[2] "what time is it?"                            
[3] "the sky is blue"                             
[4] "hi, how are you doing today? You seem tired."
[5] "walk the dog"                                
[6] "the grass is green"                          
[7] "the sky is blue during the day" 

如何识别所有前四个单词匹配的条目,然后仅保留最长的匹配字符串?我正在寻找结果,使其看起来像以下向量:

vector                    
[1] "what time is it?"                                                        
[2] "hi, how are you doing today? You seem tired."
[3] "walk the dog"                                
[4] "the grass is green"                          
[5] "the sky is blue during the day"                          

理想情况下,我想使用stringr解决方案,以便将其送入管道。

更新:具有不同值的健壮性检查:

@Wimpel的解决方案非常出色,但正如@Wimpel指出的那样,它并不是在所有情况下都有效。参见例如:

vector<-c("hi, how are you doing?", 
          "what time is it?", 
          "the sky is blue", 
          "hi, how are you doing today? You seem tired.", 
          "walk the dog", 
          "the grass is green", 
          "the sky is blue during the day", 
          "12/7/2018", 
          "8/12/2018", 
          "9/9/2016 ")

df <- data.frame( text = vector, stringsAsFactors = FALSE ) 
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) ) 
df %>%
    mutate( length = str_count( text, " ") + 1,
            row_id = row_number() ) %>%
    group_by( group_id ) %>%
    arrange( -length ) %>%
    slice(1) %>%
    ungroup() %>%
    arrange( row_id ) %>%
    select( text )

1 what time is it?                            
2 hi, how are you doing today? You seem tired.
3 walk the dog                                
4 the grass is green                          
5 the sky is blue during the day  

在上面的示例中,即使日期不匹配,日期也会被剪切掉。

1 个答案:

答案 0 :(得分:5)

使用更新的样本数据

vec <- c("hi, how are you doing?", 
          "what time is it?", 
          "the sky is blue", 
          "hi, how are you doing today? You seem tired.", 
          "walk the dog", 
          "the grass is green", 
          "the sky is blue during the day", 
          "12/7/2018", 
          "8/12/2018", 
          "9/9/2016")

代码

library( tidyverse )

df <- data.frame( text = vec, stringsAsFactors = FALSE ) 
#greate group_indices
df$group_id <- df %>% group_indices( stringr::word( text, start = 1, end = 4) ) 

df %>%
  #create some helping variables
  mutate( length = str_count( text, " ") + 1,
          row_id = row_number() ) %>%
  #now group on id
  group_by( group_id ) %>%
  #arrange by group on length (descending)
  arrange( -length ) %>%
  #keep only the first row (of every group ), also keep all strings shorter than 4 words
  filter( (row_number() == 1L & length >= 4) | length < 4 ) %>%
  ungroup() %>%
  #set back to the original order
  arrange( row_id ) %>%
  select( text )

输出

# # A tibble: 8 x 1
# text                                        
#   <chr>                                       
# 1 what time is it?                            
# 2 hi, how are you doing today? You seem tired.
# 3 walk the dog                                
# 4 the grass is green                          
# 5 the sky is blue during the day              
# 6 12/7/2018                                   
# 7 8/12/2018  
# 8 9/9/2016