为变量中的每个唯一元素创建转换矩阵

时间:2016-07-23 14:03:54

标签: r database matrix data-manipulation

我在创建转换矩阵方面存在问题,位于我正在处理的数据集之下,

Name     Rating   ID   DATE(YYYYmmdd)   
@0CC        1   71476   20000704    
@0CC        1   71476   20001204    
@0RM        1   73565   20000919    
@0RM        2   49960   20000131    
@0RM        1   44457   20001214    
@0RM        1   59451   20001023    
@0TL        2   73862   20001212    
@0TL        3   19824   20000929    
@0TL        1   70970   20001211    
@0TL        3   48061   20000627          
@0TL        1   48061   20001227    
@1AJ        1   58875   20001214    
@1AJ        3   56014   20001214    
@1AJ        3   47340   20001214    
@1AJ        3   19813   20001214    
@1AL        1   44416   20000517    
@1AL        4   59184   20000801    
@1AL        3   59184   20000413    
@1AL        4   72832   20001127    
@1AL        1   52718   20000621    
@1AL        2   59184   20000707    
@1AL        3   73568   20001130    
@1AL        3   72832   20001211    
@1AL        3   44416   20000303    

我想要做的是每个唯一的名字,我想比较ID,如果ID匹配,我会查看日期,比较后一个日期和上一个日期,如果评级相似,我会忽略,但如果评级不同,我想计算一定数量。

在前两行中,看看Name @OCC,ID变量匹配并查看评级,它们是相似的,然后我不添加它来计算。但是,查看@ 1AL,ID变量匹配三次,查看发生的日期,有三个日期20000413,20000707和20000801,分别为3,2和4。随着评级再次从3变为2到4,我想将其记录在以下格式的转换矩阵中。

 From   1 2 3 4 5 (to)
  1
  2           1 
  3       1
  4
  5

对于这个数据管理事物来说,这是我的新事物,

for(i in unique(dataset$Name)
if dataset[,3]=dataset[,3]

我不认为第二行甚至是正确的。我真的很困难,并希望得到任何我能得到的建议。

1 个答案:

答案 0 :(得分:1)

花了一些时间,但我想我找到了解决问题的方法:

转换为data.table

install.packages("data.table") #if not installed already
require(data.table)
### DT: your data.frame
### e.g. copy and 
#DT <- read.table("clipboard",header = T)
DT <- as.data.table(DT) # convert into data.table
setkey(DT, Name, DATE)
#this shows some temporary result:
DT[, print(Rating), by = list(Name, ID)]
  # [1] 1 1
  # [1] 1
  # [1] 2
  # [1] 1
  # [1] 1
  # [1] 2
  # [1] 3
  # [1] 1
  # [1] 3 1
  # [1] 1
  # [1] 3
  # [1] 3
  # [1] 3
  # [1] 1 3
  # [1] 4 3 2
  # [1] 4 3
  # [1] 1
  # [1] 3

一个问题是data.table没有为每个子集返回一个向量(据我所知)。因此,解决方案是将单个数字转换为更长的数字并稍后将其转换回来。

获得评分

setVal <- function(vec){
  res <- 0
  for (i in 1:length(vec)){
    res <- res + vec[i] * 10^(length(vec)-i)
  }
  return(as.integer(res))
}
#save above shown result in vector.
DT <- DT[, R:=setVal(Rating), by = list(Name, ID)]
DT #the result is not as desired because e.g. 324 occurs 3 times (at each row which leads to 324), 11 occurs 2 times (at both rows contributing to 11).
  # Name Rating    ID DATE.YYYYmmdd.   R
  # 1: @0CC      1 71476       20000704  11
  # 2: @0CC      1 71476       20001204  11
  # 3: @0RM      2 49960       20000131   2
  # 4: @0RM      1 73565       20000919   1
  # 5: @0RM      1 59451       20001023   1
  # 6: @0RM      1 44457       20001214   1
  # 7: @0TL      3 48061       20000627  31
  # 8: @0TL      3 19824       20000929   3
  # 9: @0TL      1 70970       20001211   1
  # 10: @0TL      2 73862       20001212   2
  # 11: @0TL      1 48061       20001227  31
  # 12: @1AJ      1 58875       20001214   1
  # 13: @1AJ      3 56014       20001214   3
  # 14: @1AJ      3 47340       20001214   3
  # 15: @1AJ      3 19813       20001214   3
  # 16: @1AL      3 44416       20000303  31
  # 17: @1AL      3 59184       20000413 324
  # 18: @1AL      1 44416       20000517  31
  # 19: @1AL      1 52718       20000621   1
  # 20: @1AL      2 59184       20000707 324
  # 21: @1AL      4 59184       20000801 324
  # 22: @1AL      4 72832       20001127  43
  # 23: @1AL      3 73568       20001130   3
  # 24: @1AL      3 72832       20001211  43
#The result has to be filtered by unique pairs of Name and ID.
R <- DT[,unique(R), by = list(Name, ID)]$V1
#[1]  11   2   1   1   1  31   3   1   2   1   3   3   3  31 324   1  43   3

将结果转换为转换矩阵

可能有一些更简单的方法可以将R转换回单个数字,计算值并将它们放入矩阵中,但这就是我的想法:

TransitionMatrix <- function(col, ncol = 5){
  intoMat <- function(Mat, vec){
    if(length(vec)>1){
      for (i in 1:(length(vec)-1)){
        if (vec[i] != vec[i+1]){
          Mat[vec[i], vec[i+1]] <- Mat[vec[i], vec[i+1]] + 1
        }
      }
    }
    return(Mat)
  }
  Mat <- matrix(0, ncol = ncol, nrow = ncol)
  for (j in 1:length(col)){
    L <- nchar(as.character(j))
    if(L>1){
      values <- as.numeric(unlist(strsplit(as.character(col[j]),"")))
      Mat <- intoMat(Mat, values)
    }
  }
  return(Mat)
}

TransitionMatrix(R, 5)
  #      [,1] [,2] [,3] [,4] [,5]
  # [1,]    0    0    2    0    0
  # [2,]    0    0    0    0    0
  # [3,]    2    3    0    0    0
  # [4,]    0    0    5    0    0
  # [5,]    0    0    0    0    0

此解决方案的限制是当评级高于9且有2位数时。