有这样的df:
df_in <- data.frame(x = c('x1','x2','x3','x4'),
col1 = c('http://youtube.com/something','NA','https://www.yahooexample.com','https://www.yahooexample2.com'),
col2 = c('https://google.com', 'http://www.bbcnews2.com?id=321','NA','https://google.com/text'),
col3 = c('http://www.bbcnews.com?id=321', 'http://google.com?id=1234','NA','https://bbcnews.com/search'),
col4 = c('NA', 'https://www.youtube/com','NA', 'www.youtube.com/searcht'))
在col1,col2和col3中怎么可能只保留包含在其中的单元格“google”或“youtube”或“bbc”,其他单词是否使单元格为NA?
预期产出的例子:
x col1 col2 col3 col4
1 x1 http://youtube.com/something https://google.com http://www.bbcnews.com?id=321 NA
2 x2 NA http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
3 x3 NA NA NA NA
4 x4 NA https://google.com/text https://bbcnews.com/search www.youtube.com/searcht
答案 0 :(得分:2)
我们可以使用mutate_at
更改列&#39; col1&#39;通过str_detect
检查其是否包含&#39; google&#39;或者&#39; youtube&#39;或者&#39; bbc&#39;并用NA
library(dplyr)
library(stringr)
df_in %>%
mutate_at(vars(col1:col4), funs(ifelse(str_detect(.,
"google|youtube|bbc"), as.character(.), NA)))
-output
# x col1 col2 col3 col4
# 1 x1 http://youtube.com/something https://google.com http://www.bbcnews.com?id=321 <NA>
# 2 x2 <NA> http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
# 3 x3 <NA> <NA> <NA> <NA>
# 4 x4 <NA> https://google.com/text https://bbcnews.com/search www.youtube.com/searcht
答案 1 :(得分:2)
您可以使用lapply
替换:
cols <- c("col1","col2","col3","col4")
df_in[,cols] <- lapply(df_in[,cols],
function(x) replace(x, !grepl("google|youtube|bbc",x ), NA))
df_in
# x col1 col2 col3 col4
#1 x1 http://youtube.com/something https://google.com http://www.bbcnews.com?id=321 <NA>
#2 x2 <NA> http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
#3 x3 <NA> <NA> <NA> <NA>
#4 x4 <NA> https://google.com/text https://bbcnews.com/search www.youtube.com/searcht
即搜索字符串中的任何位置。如果您只想确保域名为"google|youtube|bbc"
,则可以将grepl
更改为:
grepl("(google|youtube|bbc).com", test_string)
答案 2 :(得分:2)
网址解析时可能会很尴尬。我建议使用库urltools
进行解析,然后使用grepl
来查找感兴趣的域(假设您对在域中找到的那些词感兴趣),即
library(urltools)
#Extract the domain
domain(df_in$col1)
#[1] "youtube.com" "na" "www.yahooexample.com" "www.yahooexample2.com"
要将其应用于您的问题,那么
df_in[] <- lapply(df_in, function(i) replace(i, !grepl('google|youtube|bbc', domain(i)), NA))
x col1 col2 col3 col4
1 <NA> http://youtube.com/something https://google.com http://www.bbcnews.com?id=321 <NA>
2 <NA> <NA> http://www.bbcnews2.com?id=321 http://google.com?id=1234 https://www.youtube/com
3 <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> https://google.com/text https://bbcnews.com/search www.youtube.com/searcht
答案 3 :(得分:1)
使用sapply()
和grep()
的组合将不适合您所需模式的元素更改为NA。
df[ , 2:5 ][] <- sapply( X = list( df$col1, df$col2, df$col3, df$col4 )
, FUN = function( i ){
# store elements that meet the condition
condition <- grep( pattern = "youtube|bbc|google"
, x = i
)
# replace elements that don't meet the condition with NA
i[ -condition ] <- NA
# return i to the Global Environment
return( i )
}
)