将字符串转换为相似度矩阵

时间:2014-01-10 19:09:24

标签: r regex string matrix

我有一些特殊格式的字符串,代表集合。在R中,我想将它们转换为相似度矩阵。

例如,一个字符串显示1 + 2包含一个集合,3个单独存在于一个集合中,而4,5和6包含一个集合是:

"1+2,3,4+5+6"

对于上面的示例,我希望能够生成

      [,1] [,2] [,3] [,4] [,5] [,6]
 [1,]    1    1    0    0    0    0
 [2,]    1    1    0    0    0    0
 [3,]    0    0    1    0    0    0
 [4,]    0    0    0    1    1    1
 [5,]    0    0    0    1    1    1
 [6,]    0    0    0    1    1    1

看起来这应该是一项非常简单的任务。我该怎么做呢?

3 个答案:

答案 0 :(得分:5)

这是一种方法:

out <- lapply(unlist(strsplit("1+2,3,4+5+6", ",")), function(x) {
    as.numeric(unlist(strsplit(x, "\\+")))
})

x <- table(unlist(out), rep(seq_along(out), sapply(out, length)))

matrix(x %*% t(x), nrow(x))

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    0    0    0    0
## [2,]    1    1    0    0    0    0
## [3,]    0    0    1    0    0    0
## [4,]    0    0    0    1    1    1
## [5,]    0    0    0    1    1    1
## [6,]    0    0    0    1    1    1

答案 1 :(得分:2)

伪代码:

Split at , to get an array of strings, each describing a set.
For each element of the array:
    Split at + to get an array of set members
    Mark every possible pairing of members of this set on the matrix

您可以在R中创建一个矩阵:

m = mat.or.vec(6, 6)

默认情况下,矩阵应初始化所有条目0.您可以使用以下内容分配新值:

m[2,3] = 1

答案 2 :(得分:1)

这是另一种方法:

# write a simple function
similarity <- function(string){
  sets <- gsub("\\+", ":", strsplit(string, ",")[[1]])
  n <- as.numeric(tail(strsplit(gsub("[[:punct:]]", "", string), "")[[1]], 1))
  mat <- mat.or.vec(n, n)
  ind <- suppressWarnings(lapply(sets, function(x) eval(parse(text=x))))

  for(i in 1:length(ind)){
    mat[ind[[i]], ind[[i]]] <- 1
  } 

  return(mat)

}

# Use that function
> similarity("1+2,3,4+5+6")
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    0    0    0    0
[2,]    1    1    0    0    0    0
[3,]    0    0    1    0    0    0
[4,]    0    0    0    1    1    1
[5,]    0    0    0    1    1    1
[6,]    0    0    0    1    1    1

# Using other string
> similarity("1+2,3,5+6+7, 8")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    1    0    0    0    0    0    0
[2,]    1    1    0    0    0    0    0    0
[3,]    0    0    1    0    0    0    0    0
[4,]    0    0    0    0    0    0    0    0
[5,]    0    0    0    0    1    1    1    0
[6,]    0    0    0    0    1    1    1    0
[7,]    0    0    0    0    1    1    1    0
[8,]    0    0    0    0    0    0    0    1