使用R查找关联对

时间:2013-02-05 08:06:17

标签: r correlation

           VZ.Close CBOU.Close SBUX.Close   T.Close
VZ.Close   1.0000000  0.5804478  0.8872978 0.9480894
CBOU.Close 0.5804478  1.0000000  0.7876277 0.4988890
SBUX.Close 0.8872978  0.7876277  1.0000000 0.8143305
T.Close    0.9480894  0.4988890  0.8143305 1.0000000

所以,假设我有股价之间的这些相关性。我想看看第一行并找到具有最高相关性的对。这将是VZ和T.然后我想删除这两个股票作为期权。然后,在剩余的股票中找到具有最高相关性的货币对。等等,直到所有股票都配对。在这个例子中,它显然是CBOU和SBUX,因为它们只剩下2个,但我希望代码能够容纳任意数量的对。

2 个答案:

答案 0 :(得分:4)

如果您想查看每个步骤的最大相关性,这是一个解决方案。因此,第一步不仅仅是第一行,而是整个矩阵。

示例数据:

d <- matrix(runif(36),ncol=6,nrow=6)
rownames(d) <- colnames(d) <- LETTERS[1:6]
diag(d) <- 1
d
           A          B         C          D         E          F
A 1.00000000 0.65209204 0.8520392 0.26980214 0.5844000 0.69335143
B 0.73531603 1.00000000 0.5499431 0.60511580 0.7483990 0.14788134
C 0.56433218 0.27242769 1.0000000 0.07952776 0.2147628 0.03711562
D 0.91756919 0.04853523 0.5554490 1.00000000 0.4344089 0.23381447
E 0.06897889 0.80740821 0.7974340 0.87425643 1.0000000 0.74546072
F 0.19961474 0.61665231 0.2829632 0.58110694 0.7433924 1.00000000

代码:

results <- data.frame(v1=character(0), v2=character(0), cor=numeric(0), stringsAsFactors=FALSE)
diag(d) <- 0
while (sum(d>0)>1) {
  maxval <- max(d)
  max <- which(d==maxval, arr.ind=TRUE)[1,]
  results <- rbind(results, data.frame(v1=rownames(d)[max[1]], v2=colnames(d)[max[2]], cor=maxval))
  d[max[1],] <- 0
  d[,max[1]] <- 0
  d[max[2],] <- 0
  d[,max[2]] <- 0
}

给出了:

  v1 v2       cor
1  D  A 0.9175692
2  E  B 0.8074082
3  F  C 0.2829632

答案 1 :(得分:0)

我认为这回答了你的问题,但我不能确定原来的问题有点模棱两可......

# Construct toy example of symmentrical matrix
# nc is number of rows/columns in matrix, in the problem above it was 4, but let's try with 6
nc <- 6
mat <- diag( 1 , nc )
# Create toy correlation data for matrix
dat <- runif( ( (nc^2-nc)/2 ) )
# Fill both triangles of matrix so it is symmetric
mat[lower.tri( mat ) ] <- dat 
mat[upper.tri( mat ) ] <- dat

# Create vector of random string names for row/column names
names <- replicate( nc , expr = paste( sample( c( letters , LETTERS ) , 3 , replace = TRUE ) , collapse = "" ) )
dimnames(mat) <- list( names , names )

# Sanity check
mat
    SXK   llq   xFL   RVW   oYQ   Seb
SXK 1.000 0.973 0.499 0.585 0.813 0.751
llq 0.973 1.000 0.075 0.533 0.794 0.826
xFL 0.499 0.099 1.000 0.099 0.481 0.968
RVW 0.075 0.813 0.620 1.000 0.620 0.307
oYQ 0.585 0.794 0.751 0.968 1.000 0.682
Seb 0.533 0.481 0.826 0.307 0.682 1.000

# Ok - to problem at hand , you can just substitute your matrix into these lines:
# Clearly the diagonal in a correlation matrix will be 1 so this is excluded as per your problem
diag( mat ) <- NA
# Now find the next highest correlation in each row and set this to NA
mat <- t( apply( mat , 1 , function(x) { x[ which.max(x) ] <- NA ; return(x) } ) ) 

# Another sanity check...!
mat

      SXK   llq   xFL   RVW   oYQ   Seb
SXK    NA    NA 0.499 0.585 0.813 0.751
llq    NA    NA 0.075 0.533 0.794 0.826
xFL 0.499 0.099    NA 0.099 0.481    NA
RVW 0.075    NA 0.620    NA 0.620 0.307
oYQ 0.585 0.794 0.751    NA    NA 0.682
Seb 0.533 0.481    NA 0.307 0.682    NA


# Now return the two remaining columns with greatest correlation in that row
res <- t( apply( mat , 1 , function(x) { y <- names( sort(x , TRUE ) )[1:2] ; return( y ) } ) )

res


[,1]  [,2] 
SXK "oYQ" "Seb"
llq "Seb" "oYQ"
xFL "SXK" "oYQ"
RVW "xFL" "oYQ"
oYQ "llq" "xFL"
Seb "oYQ" "SXK"

这会回答你的问题吗?