两个数据帧中带有计数的类似字符串

时间:2017-04-21 13:47:52

标签: r

我有两个数据,我试图找到它们之间的相似字符串。

df1 <- structure(list(split = structure(c(7L, 6L, 8L, 3L, 2L, 4L, 9L, 
4L, 9L, 5L, 10L, 1L), .Label = c("America1", "corea", "coreanorth1", 
"gdyijq", "gqdtr", "india-2", "india1", "india3", "udyhfs", "USA"
), class = "factor"), count = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 
4L, 4L, 5L, 5L)), .Names = c("split", "count"), row.names = c(NA, 
-12L), class = "data.frame")

看起来像这样

split       count
india1        1
india-2       1
india3        1
coreanorth1   2
corea         2
gdyijq        3
udyhfs        3
gdyijq        4
udyhfs        4
gqdtr         4
USA           5
America1      5

我有另一个具有相同结构的数据,

df2<- structure(list(split = structure(c(3L, 2L, 1L), .Label = c("America1", 
"gdyijq", "india1"), class = "factor"), count = 1:3), .Names = c("split", 
"count"), class = "data.frame", row.names = c(NA, -3L))



    split    count
   india1     1
   gdyijq     2
 America1     3

我想检查df2中df1中是否存在任何字符串,并将逗号分隔为例如

india1在df2中,与df1中的india1类似,因此输出为

india1  1,1

如果它出现不止一次,每次用分号分开 gdyijq

输出如下所示

india1     1,1
gdyijq     2,3;2,4
America1   3,5

3 个答案:

答案 0 :(得分:3)

你想要从dplyr获得合并或加入的东西:

library(dplyr)
(DF <- inner_join(df1, df2, by = "split")

现在我们必须将一个分割的所有条目组合在一起:

DF %>%
  group_by(split) %>%
  summarize(counts = paste0(count.x, ",", count.y, collapse = ";"))

结果

# A tibble: 3 × 2
     split  counts
     <chr>   <chr>
1 America1     5,3
2   gdyijq 3,2;4,2
3   india1     1,1

答案 1 :(得分:2)

这不会给你确切的结果,但它会在数据框中列出所有匹配项和每个匹配项的计数值:

z = merge(df1,df2,by = "split")

结果:

> z

     split count.x count.y
1 America1       5       3
4   gdyijq       4       2
5   gdyijq       3       2
8   india1       1       1

答案 2 :(得分:2)

这是一个可能的data.table版本:

library(data.table)

# convert to data.table
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)

# set keys for use in matching
setkey(df1, split)
setkey(df2, split)

# chain operations
# match values in df1 using df2; 
# then paste the counts (i.count from df1)
# merge row using split as group (i.count: count from df2)
df1[df2][ , .(split, count = paste(i.count, count, sep =",",  collapse=";")), by = split]

输出是这样的:

      split  counts
1: America1     3,5
2:   gdyijq 2,3;2,4
3:   india1     1,1