从特定单词的行中删除字符串

时间:2019-12-17 16:19:42

标签: r

我的数据如下:

Weather                           
   <chr>                             
 1 Snow Low clouds                   
 2 Snow Cloudy                       
 3 Drizzle Fog                       
 4 Thundershowers Partly cloudy      
 5 Thunderstorms More clouds than sun
 6 Sprinkles Partly cloudy           
 7 Heavy rain Broken clouds          
 8 Light rain Partly cloudy     

我正在尝试使用mutate删除一些文本。例如,我希望上面的样子:

Weather                           
   <chr>                             
 1 Snow                   
 2 Snow                       
 3 Drizzle                      
 4 Thundershowers      
 5 Thunderstorms More clouds than sun
 6 Sprinkles Partly cloudy           
 7 Heavy rain           
 8 Light rain 

所以我想删除一些特定单词之后的文本。如果我有以下向量:

c("Snow", "Drizzle", "Heavy rain", "Light rain") 

在这些之后删除文本。但是,我不想使用grep之类的单词,例如CloudyFog,因为它们出现在数据中它们自己的行中,但是像Snow Light fog这样的单词可以缩减为{{ 1}}。

数据:

Snow

2 个答案:

答案 0 :(得分:3)

您可以在此处采用的一般方法是为所有目标词建立一个正则表达式替代。然后,匹配这些术语,并在输入的末尾匹配所有内容,并仅替换为术语。

terms <- c("Snow", "Drizzle", "Heavy rain", "Light rain")
regex <- paste0("\\b(", paste(terms, collapse="|"), ")\\b")
sub(paste0(regex, "\\s.*"), "\\1", d$Weather)

 [1] "Snow"                               "Snow"                              
 [3] "Drizzle"                            "Thundershowers Partly cloudy"      
 [5] "Thunderstorms More clouds than sun" "Sprinkles Partly cloudy"           
 [7] "Heavy rain"                         "Light rain"                        
 [9] "Rain showers Passing clouds"        "Thundershowers Scattered clouds"   
[11] "Thundershowers Passing clouds"      "Light snow Overcast"               
[13] "Snow"                               "Drizzle"                           
[15] "Light rain"                         "Cloudy"                            
[17] "Thunderstorms Partly cloudy"        "Heavy rain"                        
[19] "Partly cloudy"                      NA

请注意,我的输出与您的预期输出并不完全一致,但是同样,您并未在建议的向量中包含所有目标词。

我使用的正则表达式是:

\b(Snow|Drizzle|Heavy rain|Light rain)\b

这里的窍门是,上述交替也是一个捕获组,使我们可以轻松地将匹配项替换为所需的术语。您可以为此添加更多的术语以获得所需的输出。

答案 1 :(得分:1)

  • 也许您可以使用下面的代码
v <- c("Snow", "Drizzle", "Heavy rain", "Light rain") 
pat <- paste0(v,collapse = "|")
unlist(regmatches(d$Weather,gregexpr(pat,d$Weather)))

如此

> unlist(regmatches(d$Weather,gregexpr(pat,d$Weather)))
[1] "Snow"       "Snow"       "Drizzle"    "Heavy rain" "Light rain" "Snow"      
[7] "Drizzle"    "Light rain" "Heavy rain"
  • 如果要添加提取的值并将它们附加到新列中的d上,则可以使用以下代码:
d <- within(d,X <- ifelse(grepl(pat,Weather),unlist(regmatches(Weather,gregexpr(pat,Weather))),NA))

如此

> d
# A tibble: 20 x 2
   Weather                            X         
   <chr>                              <chr>     
 1 Snow Low clouds                    Snow      
 2 Snow Cloudy                        Snow      
 3 Drizzle Fog                        Drizzle   
 4 Thundershowers Partly cloudy       NA        
 5 Thunderstorms More clouds than sun NA        
 6 Sprinkles Partly cloudy            NA        
 7 Heavy rain Broken clouds           Drizzle   
 8 Light rain Partly cloudy           Light rain
 9 Rain showers Passing clouds        NA        
10 Thundershowers Scattered clouds    NA        
11 Thundershowers Passing clouds      NA        
12 Light snow Overcast                NA        
13 Snow Light fog                     Heavy rain
14 Drizzle Broken clouds              Light rain
15 Light rain Fog                     Snow      
16 Cloudy                             NA        
17 Thunderstorms Partly cloudy        NA        
18 Heavy rain More clouds than sun    Heavy rain
19 Partly cloudy                      NA        
20 NA                                 NA