计算R中数据帧中字符串匹配的数量

时间:2014-12-13 15:38:44

标签: r compare dataframe string-matching

我有一个看起来像这样的数据框,我想比较book_id1和book_id2并计算"之间的字符串数量。 "并以逗号分隔

id1 id2 book_id1                      numberofbook_id1 book_id2          numberofbook_id2
 1   2  ["19167120","237494310","195166798"]    3      ["19167120","237494310"]   2
 1   3  ["19167120","237494310","195166798"]    3      []                         0
 2   3  ["19167120","237494310"]               2       []                         0

我想要的输出是这样的:

id1 id2 book_id1                     numberofbook_id1 book_id2          numberofbook_id2    count
 1   2  ["19167120","237494310","195166798"]    3      ["19167120","237494310"]   2            2
 1   3  ["19167120","237494310","195166798"]    3      []                         0            0
 2   3   ["19167120","237494310"]               2      []                         0            0

提前谢谢

1 个答案:

答案 0 :(得分:0)

如果您想获得匹配字符串的数量

 library(stringr)
 count <- sapply(Map(intersect,str_extract_all(df$book_id1, '\\d+'),
         str_extract_all(df$book_id2, '\\d+')), length)
 count
 #[1] 2 0 0

 transform(df, count=count)

或者,如果您只需要计数,

nchar(gsub('[^,]+', '',df$book_id1))+1
#[1] 3 3 2

count <- nchar(gsub('[^,]+', '',df$book_id2))
transform(df, count= ifelse(count==1, count+1, 0))
#    id1 id2                             book_id1 numberofbook_id1
#1   1   2 ["19167120","237494310","195166798"]                3
#2   1   3 ["19167120","237494310","195166798"]                3
#3   2   3             ["19167120","237494310"]                2
#                   book_id2 numberofbook_id2 count
#1 ["19167120","237494310"]                2     2
#2                       []                0     0
#3                       []                0     0

数据

df <- structure(list(id1 = c(1L, 1L, 2L), id2 = c(2L, 3L, 3L), book_id1 =
 c("[\"19167120\",\"237494310\",\"195166798\"]", 
"[\"19167120\",\"237494310\",\"195166798\"]", "[\"19167120\",\"237494310\"]"
), numberofbook_id1 = c(3L, 3L, 2L), book_id2 = c("[\"19167120\",\"237494310\"]", 
"[]", "[]"), numberofbook_id2 = c(2L, 0L, 0L)), .Names = c("id1", 
"id2", "book_id1", "numberofbook_id1", "book_id2", "numberofbook_id2"
 ), class = "data.frame", row.names = c(NA, -3L))