如何在R中的数据框中匹配和替换字符串?

时间:2019-03-28 17:44:32

标签: r dataframe replace

我有一个主数据框,其中包含许多正在使用的网站,另一个数据框包含一个不良网站列表,以匹配并标识我的主数据框中是否有不良网站。由于我对此非常陌生,因此我不确定如何将不良网站替换为“ www.badwebsite.com”?谢谢。

以下是数据帧的示例:

site_list <- data.frame("host" = c("www.companya.com", "www.companyb.com", "www.malwaresite.com",
                                   "www.companyc.com", "www.companyd.com", "www.virussite.com",
                                   "www.companye.com", "www.companyf.com", "www.phishingsite.com"),
                        "URL" = c("www.companya.com/home", "www.companyb.com/home", "www.malwaresite.com/home",
                                  "www.companyc.com/home", "www.companyd.com/home", "www.virussite.com/home",
                                  "www.companye.com/home", "www.companyf.com/home", "www.phishingsite.com/home"))

bad_site_list <- data.frame("host" = c("www.malwaresite.com", "www.virussite.com", "www.phishingsite.com"))

我希望达到这个结果:

host                                  URL
www.companya.com               www.companya.com/home
www.companyb.com               www.companyb.com/home
www.badwebsite.com             www.badwebsite.com/home
www.companyc.com               www.companyc.com/home
www.companyd.com               www.companyd.com/home
www.badwebsite.com             www.badwebsite.com/home
www.companye.com               www.companye.com/home
www.companyf.com               www.companyf.com/home
www.badwebsite.com             www.badwebsite.com/home

3 个答案:

答案 0 :(得分:1)

对于您的简单示例,我将通过以下方式进行操作,对于更复杂的表可能不是最佳选择:

apply(site_list, 2, function(x)gsub(paste(bad_site_list$host, collapse="|"), "www.badwebsite.com", x))

在apply中:“ 2”表示您将在每列上应用一个函数(“ 1”在每行中应用)。
该功能在bad_site_list中查找所有主机,并将其替换为www.badwebsite.com(使用gsub)

答案 1 :(得分:1)

没有正则表达式,您可以这样:

# Converting factor columsn to character
site_list[] <- lapply(site_list, as.character)
bad_site_list[] <- lapply(bad_site_list, as.character)

# If you want to replace all the bad sites with "www.badwebsite.com" you can:
site_list$URL[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com/home"
site_list$host[site_list$host %in% bad_site_list$host] <- "www.badwebsite.com"

site_list
                host                     URL
1   www.companya.com   www.companya.com/home
2   www.companyb.com   www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4   www.companyc.com   www.companyc.com/home
5   www.companyd.com   www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7   www.companye.com   www.companye.com/home
8   www.companyf.com   www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home

使用正则表达式,您可以这样:

# Using regex you could create a pattern 
bad_site_pattern <- paste(bad_site_list$host, collapse = "|")

# Then replace all instances in the dataframe using lapply
site_list[] <- lapply(site_list, gsub, pattern = bad_site_pattern, replacement = "www.badwebsite.com")

site_list
                host                     URL
1   www.companya.com   www.companya.com/home
2   www.companyb.com   www.companyb.com/home
3 www.badwebsite.com www.badwebsite.com/home
4   www.companyc.com   www.companyc.com/home
5   www.companyd.com   www.companyd.com/home
6 www.badwebsite.com www.badwebsite.com/home
7   www.companye.com   www.companye.com/home
8   www.companyf.com   www.companyf.com/home
9 www.badwebsite.com www.badwebsite.com/home

答案 2 :(得分:0)

加载库(字符串)

在向量中搜索字符串:

str_detect(dataframe_name,“ string_your_searching_for”)

替换向量中的字符串:

str_replace(数据帧名称,“旧字符串”,“新字符串”)