比较r中的字符串并创建数据框

时间:2019-01-11 02:00:30

标签: r

我的数据框包含电子邮件和域名,我想用与域匹配的电子邮件地址和不匹配的电子邮件地址分开。

说我有一个df:

email <- c('abc@kjf.com', 'jkl@def.com', 'ghi@kjf.com', 'def@kjf.com' , 'mno@asdf.com')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- as.data.frame(cbind(email,website))

结果为:

> df
        email            website
1 abc@kjf.com http://www.kjf.com
2 jkl@def.com http://www.kjf.com
3 ghi@kjf.com http://www.kjf.com
4 def@kjf.com http://www.kjf.com
5 mno@asdf.com http://www.asdf.com

我想动态创建2个数据帧。电子邮件域与网站域相匹配的域,例如:

> df2
        email            website
1 abc@kjf.com http://www.kjf.com
2 ghi@kjf.com http://www.kjf.com
3 def@kjf.com http://www.kjf.com
4 mno@asdf.com http://www.asdf.com

和一个保持不匹配的字符,例如;

> df3
        email            website
1 jkl@def.com http://www.kjf.com

我认为我应该使用“ regex”,但是我不确定。有人看到这怎么可行吗? 谢谢

2 个答案:

答案 0 :(得分:3)

使用此功能,您可以过滤行

gsub('.*@', '', df$email) != gsub('https?://(www\\.)?', '', df$website)
# [1] FALSE  TRUE FALSE FALSE FALSE

故障:

gsub('.*@', '', df$email)
###   .*   zero or more characters, followed by
###     @  a literal ampersand
# [1] "kjf.com"  "def.com"  "kjf.com"  "kjf.com"  "asdf.com"

,并输入网址:

gsub('https?://(www\\.)?', '', df$website)
###   http                literal string 'http'
###       s?              with exactly zero or one instance 's'
###         ://           literal string '://'
###            (www\\.)?  with exactly zero or one instance of 'www.'
# [1] "kjf.com"  "kjf.com"  "kjf.com"  "kjf.com"  "asdf.com"

答案 1 :(得分:1)

您可以创建一列来标识电子邮件域和网站域是否相同:

library(tidyverse)

email <- c('abc@kjf.com', 'jkl@def.com', 'ghi@kjf.com', 'def@kjf.com' , 'mno@asdf.com')
website <- c('http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.kjf.com', 'http://www.asdf.com')
df <- data.frame(
  email = email,
  website = website
)

df <- df %>% mutate(
  same = (email %>% str_sub(
    start = str_locate(., '@')[,'end'] + 1,
    end = -1L)) ==
    (website %>% str_sub(
      start = str_locate(., 'www.')[,'end'] + 1,
      end = -1L))
)

df2 <- df %>% filter(
  same
) %>% select(
  -same
)

df3 <- df %>% filter(
  !same
) %>% select(
  -same
)