我有一个数据帧,其中包含SELECT Name, Employee.Area, Area.id_area
FROM Employee
JOIN Area ON Employee.Area = Area.id_area
(10个观察值)和Name1
,其中包含3个观察值。我有以下玩具示例:
Name2
如果 Name1 Name2
Acadian Hospitals Wellington
Bridgewater Trust Associates Zeus
Concordia Consulting Acadian
Wellington Corporation LLC .
Wellington Wealth Management .
Prime Acadian Charity
能够匹配Name1
中其字符串的一部分,我希望column3中的输出为Name2
。
目前,我的代码只能使用TRUE
我的最终输出应如下所示:
pmatch
答案 0 :(得分:3)
听起来Name2
实际上只是一组查找值。在这种情况下,您可以通过将所有值粘贴在一起来构建查找,然后对所有grepl
进行一次简单的df$Name2
搜索:
df$Is_Matched <- grepl(paste(df$Name2[df$Name2 == "."], collapse = "|"), df$Name1)
# Name1 Name2 Is_Matched
#1 Acadian Hospitals Wellington TRUE
#2 Bridgewater Trust Associates Zeus FALSE
#3 Concordia Consulting Acadian FALSE
#4 Wellington Corporation LLC . TRUE
#5 Wellington Wealth Management . TRUE
#6 Prime Acadian Charity . TRUE
请注意,这假设Name2
中的缺失值被编码为"."
而不是NA
。更改为缺少值的任何其他编码将很容易。
答案 1 :(得分:2)
您可以使用sapply
。没有示例,我认为类似的事情应该起作用。我将在几秒钟内检查一个示例。
df$Is_Matched <- sapply(df$Name2, function(x) any(grepl(x, df$Name1))
编辑:
创建示例数据框很有帮助。 sapply
正在导出一个矩阵,其中Name2
中的每个单词都有自己的列。因此,您可以使用rowSums(true = 1,false = 0)测试是否有任何行包含true。让我知道您是否有任何问题。
> df <- data.frame(
+ Name1 = c("Acadian Hospitals", "Bridgewater Trust Associates",
+ "Concordia Consulting", "Wellington Corporation LLC",
+ "Wellington Wealth Management", "Prime Acadian Charity"),
+ Name2 = c("Wellington", "Zeus", "Acadian", NA, NA, NA),
+ stringsAsFactors = FALSE
+ )
>
> match_me <- na.omit(df$Name2)
> df$Is_Matched <- rowSums(sapply(match_me, function(x) grepl(x, df$Name1))) > 0
> df
Name1 Name2 Is_Matched
1 Acadian Hospitals Wellington TRUE
2 Bridgewater Trust Associates Zeus FALSE
3 Concordia Consulting Acadian FALSE
4 Wellington Corporation LLC <NA> TRUE
5 Wellington Wealth Management <NA> TRUE
6 Prime Acadian Charity <NA> TRUE
答案 2 :(得分:2)
在Mike H.的帮助下:
Name1 = c("Bridgewater Trust Associates", "Acadian Wealth Management", "Wellington Wealth Trust", "Concordia University", "Southern Zeus College", "Parametric Modeling", "Wellington City Corporation", "Hotel Zanzibar")
Name2 = c("Acadian", "Wellington", "Zeus")
max.len = max(length(Name1), length(Name2))
Name1 = c(Name1, rep(NA, max.len - length(Name1)))
Name2 = c(Name2, rep(NA, max.len - length(Name2)))
column3 <- grepl(paste(Name2, collapse = "|"), Name1)
df <- data.frame(Name1, Name2, column3, stringsAsFactors = FALSE)