我有两个数据,我试图找到它们之间的相似字符串。
df1 <- structure(list(split = structure(c(7L, 6L, 8L, 3L, 2L, 4L, 9L,
4L, 9L, 5L, 10L, 1L), .Label = c("America1", "corea", "coreanorth1",
"gdyijq", "gqdtr", "india-2", "india1", "india3", "udyhfs", "USA"
), class = "factor"), count = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L,
4L, 4L, 5L, 5L)), .Names = c("split", "count"), row.names = c(NA,
-12L), class = "data.frame")
看起来像这样
split count
india1 1
india-2 1
india3 1
coreanorth1 2
corea 2
gdyijq 3
udyhfs 3
gdyijq 4
udyhfs 4
gqdtr 4
USA 5
America1 5
我有另一个具有相同结构的数据,
df2<- structure(list(split = structure(c(3L, 2L, 1L), .Label = c("America1",
"gdyijq", "india1"), class = "factor"), count = 1:3), .Names = c("split",
"count"), class = "data.frame", row.names = c(NA, -3L))
split count
india1 1
gdyijq 2
America1 3
我想检查df2中df1中是否存在任何字符串,并将逗号分隔为例如
india1在df2中,与df1中的india1类似,因此输出为
india1 1,1
如果它出现不止一次,每次用分号分开 gdyijq
输出如下所示
india1 1,1
gdyijq 2,3;2,4
America1 3,5
答案 0 :(得分:3)
你想要从dplyr获得合并或加入的东西:
library(dplyr)
(DF <- inner_join(df1, df2, by = "split")
现在我们必须将一个分割的所有条目组合在一起:
DF %>%
group_by(split) %>%
summarize(counts = paste0(count.x, ",", count.y, collapse = ";"))
结果
# A tibble: 3 × 2
split counts
<chr> <chr>
1 America1 5,3
2 gdyijq 3,2;4,2
3 india1 1,1
答案 1 :(得分:2)
这不会给你确切的结果,但它会在数据框中列出所有匹配项和每个匹配项的计数值:
z = merge(df1,df2,by = "split")
结果:
> z
split count.x count.y
1 America1 5 3
4 gdyijq 4 2
5 gdyijq 3 2
8 india1 1 1
答案 2 :(得分:2)
这是一个可能的data.table版本:
library(data.table)
# convert to data.table
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)
# set keys for use in matching
setkey(df1, split)
setkey(df2, split)
# chain operations
# match values in df1 using df2;
# then paste the counts (i.count from df1)
# merge row using split as group (i.count: count from df2)
df1[df2][ , .(split, count = paste(i.count, count, sep =",", collapse=";")), by = split]
输出是这样的:
split counts
1: America1 3,5
2: gdyijq 2,3;2,4
3: india1 1,1