概述

Question

有这样的df：

df_in <- data.frame(x = c('x1','x2','x3','x4'),
                     col1 = c('http://youtube.com/something','NA','https://www.yahooexample.com','https://www.yahooexample2.com'),
                     col2 = c('https://google.com', 'http://www.bbcnews2.com?id=321','NA','https://google.com/text'),
                     col3 = c('http://www.bbcnews.com?id=321', 'http://google.com?id=1234','NA','https://bbcnews.com/search'),
                     col4 = c('NA', 'https://www.youtube/com','NA', 'www.youtube.com/searcht'))

在col1，col2和col3中怎么可能只保留包含在其中的单元格“google”或“youtube”或“bbc”，其他单词是否使单元格为NA？

预期产出的例子：

   x                          col1                           col2                          col3                    col4
1 x1  http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                      NA
2 x2                            NA http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
3 x3  NA                             NA                            NA                      NA
4 x4 NA        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

Answer 1

我们可以使用mutate_at更改列＆＃39; col1＆＃39;通过str_detect检查其是否包含＆＃39; google＆＃39;或者＆＃39; youtube＆＃39;或者＆＃39; bbc＆＃39;并用NA

替换其他元素

library(dplyr)
library(stringr)
df_in %>%
     mutate_at(vars(col1:col4), funs(ifelse(str_detect(., 
                "google|youtube|bbc"), as.character(.), NA)))

-output

#    x                         col1                           col2                          col3                    col4
#  1 x1 http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                    <NA>
#  2 x2                         <NA> http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
#  3 x3                         <NA>                           <NA>                          <NA>                    <NA>
#  4 x4                         <NA>        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

Answer 2

您可以使用lapply替换：

cols <- c("col1","col2","col3","col4")
df_in[,cols] <- lapply(df_in[,cols], 
                       function(x) replace(x, !grepl("google|youtube|bbc",x ), NA))

df_in
#   x                         col1                           col2                          col3                    col4
#1 x1 http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                    <NA>
#2 x2                         <NA> http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
#3 x3                         <NA>                           <NA>                          <NA>                    <NA>
#4 x4                         <NA>        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

即搜索字符串中的任何位置。如果您只想确保域名为"google|youtube|bbc"，则可以将grepl更改为：

grepl("(google|youtube|bbc).com", test_string)

Answer 3

网址解析时可能会很尴尬。我建议使用库urltools进行解析，然后使用grepl来查找感兴趣的域（假设您对在域中找到的那些词感兴趣），即

library(urltools)

#Extract the domain
domain(df_in$col1)
#[1] "youtube.com"  "na"    "www.yahooexample.com"  "www.yahooexample2.com"

要将其应用于您的问题，那么

df_in[] <- lapply(df_in, function(i) replace(i, !grepl('google|youtube|bbc', domain(i)), NA))



 x                         col1                           col2                          col3                    col4
1 <NA> http://youtube.com/something             https://google.com http://www.bbcnews.com?id=321                    <NA>
2 <NA>                         <NA> http://www.bbcnews2.com?id=321     http://google.com?id=1234 https://www.youtube/com
3 <NA>                         <NA>                           <NA>                          <NA>                    <NA>
4 <NA>                         <NA>        https://google.com/text    https://bbcnews.com/search www.youtube.com/searcht

Answer 4

概述

使用sapply()和grep()的组合将不适合您所需模式的元素更改为NA。

df[ , 2:5 ][] <- sapply( X = list( df$col1, df$col2, df$col3, df$col4 )
                          , FUN = function( i ){
                            # store elements that meet the condition
                            condition <- grep( pattern = "youtube|bbc|google"
                                                , x = i
                             )
                             # replace elements that don't meet the condition with NA
                             i[ -condition ] <- NA
                             # return i to the Global Environment
                             return( i )
                          }
                        )

仅保留字符串

4 个答案:

概述