我的数据框包含电子邮件和域名,我想用与域匹配的电子邮件地址和不匹配的电子邮件地址分开。
说我有一个df:
email <- c('abc@kjf.com', 'jkl@def.com', 'ghi@kjf.com', 'def@kjf.com' , 'mno@asdf.com')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- as.data.frame(cbind(email,website))
结果为:
> df
email website
1 abc@kjf.com http://www.kjf.com
2 jkl@def.com http://www.kjf.com
3 ghi@kjf.com http://www.kjf.com
4 def@kjf.com http://www.kjf.com
5 mno@asdf.com http://www.asdf.com
我想动态创建2个数据帧。电子邮件域与网站域相匹配的域,例如:
> df2
email website
1 abc@kjf.com http://www.kjf.com
2 ghi@kjf.com http://www.kjf.com
3 def@kjf.com http://www.kjf.com
4 mno@asdf.com http://www.asdf.com
和一个保持不匹配的字符,例如;
> df3
email website
1 jkl@def.com http://www.kjf.com
我认为我应该使用“ regex”,但是我不确定。有人看到这怎么可行吗? 谢谢
答案 0 :(得分:3)
使用此功能,您可以过滤行
gsub('.*@', '', df$email) != gsub('https?://(www\\.)?', '', df$website)
# [1] FALSE TRUE FALSE FALSE FALSE
故障:
gsub('.*@', '', df$email)
### .* zero or more characters, followed by
### @ a literal ampersand
# [1] "kjf.com" "def.com" "kjf.com" "kjf.com" "asdf.com"
,并输入网址:
gsub('https?://(www\\.)?', '', df$website)
### http literal string 'http'
### s? with exactly zero or one instance 's'
### :// literal string '://'
### (www\\.)? with exactly zero or one instance of 'www.'
# [1] "kjf.com" "kjf.com" "kjf.com" "kjf.com" "asdf.com"
答案 1 :(得分:1)
您可以创建一列来标识电子邮件域和网站域是否相同:
library(tidyverse)
email <- c('abc@kjf.com', 'jkl@def.com', 'ghi@kjf.com', 'def@kjf.com' , 'mno@asdf.com')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- data.frame(
email = email,
website = website
)
df <- df %>% mutate(
same = (email %>% str_sub(
start = str_locate(., '@')[,'end'] + 1,
end = -1L)) ==
(website %>% str_sub(
start = str_locate(., 'www.')[,'end'] + 1,
end = -1L))
)
df2 <- df %>% filter(
same
) %>% select(
-same
)
df3 <- df %>% filter(
!same
) %>% select(
-same
)