仅保留字符串

时间:2018-02-21 14:06:52

标签: r regex

有这样的df:

df_in <- data.frame(x = c('x1','x2','x3','x4'),
                     col1 = c('http://youtube.com/something','NA','https://www.yahooexample.com','https://www.yahooexample2.com'),
                     col2 = c('https://google.com', 'http://www.bbcnews2.com?id=321','NA','https://google.com/text'),
                     col3 = c('http://www.bbcnews.com?id=321', 'http://google.com?id=1234','NA','https://bbcnews.com/search'),
                     col4 = c('NA', 'https://www.youtube/com','NA', 'www.youtube.com/searcht'))

在col1,col2和col3中怎么可能只保留包含在其中的单元格“google”或“youtube”或“bbc”,其他单词是否使单元格为NA?

预期产出的例子:

   x                          col1                           col2                          col3                    col4
1 x1  http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                      NA
2 x2                            NA http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
3 x3  NA                             NA                            NA                      NA
4 x4 NA        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

4 个答案:

答案 0 :(得分:2)

我们可以使用mutate_at更改列&#39; col1&#39;通过str_detect检查其是否包含&#39; google&#39;或者&#39; youtube&#39;或者&#39; bbc&#39;并用NA

替换其他元素
library(dplyr)
library(stringr)
df_in %>%
     mutate_at(vars(col1:col4), funs(ifelse(str_detect(., 
                "google|youtube|bbc"), as.character(.), NA)))

-output

#    x                         col1                           col2                          col3                    col4
#  1 x1 http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                    <NA>
#  2 x2                         <NA> http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
#  3 x3                         <NA>                           <NA>                          <NA>                    <NA>
#  4 x4                         <NA>        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

答案 1 :(得分:2)

您可以使用lapply替换:

cols <- c("col1","col2","col3","col4")
df_in[,cols] <- lapply(df_in[,cols], 
                       function(x) replace(x, !grepl("google|youtube|bbc",x ), NA))

df_in
#   x                         col1                           col2                          col3                    col4
#1 x1 http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                    <NA>
#2 x2                         <NA> http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
#3 x3                         <NA>                           <NA>                          <NA>                    <NA>
#4 x4                         <NA>        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

即搜索字符串中的任何位置。如果您只想确保域名为"google|youtube|bbc",则可以将grepl更改为:

grepl("(google|youtube|bbc).com", test_string)

答案 2 :(得分:2)

网址解析时可能会很尴尬。我建议使用库urltools进行解析,然后使用grepl来查找感兴趣的域(假设您对在域中找到的那些词感兴趣),即

library(urltools)

#Extract the domain
domain(df_in$col1)
#[1] "youtube.com"  "na"    "www.yahooexample.com"  "www.yahooexample2.com"

要将其应用于您的问题,那么

df_in[] <- lapply(df_in, function(i) replace(i, !grepl('google|youtube|bbc', domain(i)), NA))



 x                         col1                           col2                          col3                    col4
1 <NA> http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                    <NA>
2 <NA>                         <NA> http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
3 <NA>                         <NA>                           <NA>                          <NA>                    <NA>
4 <NA>                         <NA>        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

答案 3 :(得分:1)

概述

使用sapply()grep()的组合将不适合您所需模式的元素更改为NA。

df[ , 2:5 ][] <- sapply( X = list( df$col1, df$col2, df$col3, df$col4 )
                          , FUN = function( i ){
                            # store elements that meet the condition
                            condition <- grep( pattern = "youtube|bbc|google"
                                                , x = i
                             )
                             # replace elements that don't meet the condition with NA
                             i[ -condition ] <- NA
                             # return i to the Global Environment
                             return( i )
                          }
                        )