拆分列表元素扩展列表

时间:2015-05-29 07:35:20

标签: r list split ocr expand

我正在进行某种光学字符识别并面临以下问题。我将字形存储在二进制矩阵列表中,它们可以具有不同的大小,但它们的最大可能宽度为wid = 3列(可以是任何已定义的常量,而不仅仅是3)。在某些情况下,在第一阶段处理后,我得到的数据如下:

myll <- list(matrix(c(0, 0, 0, 1, 1, 0), ncol = 2),
             matrix(c(0), ncol = 1),
             matrix(c(1, 1, 0), ncol = 3),
             matrix(c(1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1), ncol = 7),
             matrix(c(1, 1, 1, 1), ncol = 2))
# [[1]]
#      [,1] [,2]
# [1,]    0    1
# [2,]    0    1
# [3,]    0    0
# 
# [[2]]
#      [,1]
# [1,]    0
# 
# [[3]]
#      [,1] [,2] [,3]
# [1,]    1    1    0
# 
# [[4]]
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]    1    1    1    0    0    0    1
# [2,]    0    1    0    1    0    0    1
# [3,]    1    1    1    1    0    0    1
# 
# [[5]]
#      [,1] [,2]
# [1,]    1    1
# [2,]    1    1

因此,某些字形可能由于某些原因而未被分开。只有最大可能宽度的字形才会发生这种情况。而且,矩阵末尾可能有一些垃圾。我必须把它们分成宽度为ncol = wid的矩阵,留下最后一块(垃圾)。然后我将这个矩阵存储在列表的单独元素中以获得以下输出:

# [[1]]
#      [,1] [,2]
# [1,]    0    1
# [2,]    0    1
# [3,]    0    0
# 
# [[2]]
#      [,1]
# [1,]    0
# 
# [[3]]
#      [,1] [,2] [,3]
# [1,]    1    1    0
# 
# [[4]]
#      [,1] [,2] [,3]
# [1,]    1    1    1
# [2,]    0    1    0
# [3,]    1    1    1
# 
# [[5]]
#      [,1] [,2] [,3]
# [1,]    0    0    0
# [2,]    1    0    0
# [3,]    1    0    0
# 
# [[6]]
#      [,1]
# [1,]    1
# [2,]    1
# [3,]    1
# 
# [[7]]
#      [,1] [,2]
# [1,]    1    1
# [2,]    1    1

目前我可以借助这个功能来实现它

checkGlyphs <- function(gl_m, wid = 3) {
  if (ncol(gl_m) > wid) 
    return(list(gl_m[,1:wid], matrix(gl_m[,-(1:wid)], nrow = nrow(gl_m)))) else
    return(gl_m)
}

separateGlyphs <- function(myll, wid = 3) {
  require("magrittr")

  presplit <- lapply(myll, checkGlyphs, wid) 
  total_new_length <- 
    presplit[unlist(lapply(presplit, is.list))] %>% lapply(length) %>% unlist() %>% sum() +
    as.integer(!unlist(lapply(presplit, is.list))) %>% sum()

  splitted <- vector("list", length = total_new_length)
  spl_index <- 1
  for (i in 1:length(presplit)) 
  {
    if (!is.list(presplit[[i]])) 
    {   
      splitted[[spl_index]] <- presplit[[i]]
      spl_index <- spl_index + 1 
    } else
    { 
      for (j in 1:length(presplit[[i]]))
      {   
        splitted[[spl_index]] <- presplit[[i]][[j]]
        spl_index <- spl_index + 1 
      }
    }
  }

  if (any(lapply(splitted, ncol) > wid)) return(separateGlyphs(splitted, wid)) else
    return(splitted)
}

但是我相信有更快速和方便的方法来实现相同的结果(不使用for循环和这个enlooped重新分配元素然后递归,如果需要O_o)。

我会感谢任何关于这一点的建议,或者为R推荐一些OCR包。

1 个答案:

答案 0 :(得分:0)

这应该可以解决问题,final中的值就是你所追求的。

combined <- do.call(cbind, lapply(myll, unlist))
idx <- seq(1, ncol(combined), 2)
final <- do.call(list, lapply(idx, function(x) combined[, x:(x+1)]))