有关系数相关性和循环的基本统计问题

时间:2019-11-23 03:42:21

标签: r

说我有两个代表随机变量的向量:

x<-rnorm(100000)
y<-rexp(100000)

使用for循环来计算两个向量之间的相关系数的代码是什么?

我对R很陌生,所以简单的答案会更好。谢谢。

2 个答案:

答案 0 :(得分:2)

嗨,乔,欢迎您!

您并不需要for循环,cor.test可以为您提供两个向量之间的相关系数。

x<-rnorm(1000) 
y<-rexp(1000)
cor.test(x,y)

您将获得以下输出:

> cor.test(x,y)

    Pearson's product-moment correlation

data:  x and y
t = 1.5191, df = 998, p-value = 0.1291
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01400425  0.10969698
sample estimates:
       cor 
0.04803053 

您还可以使用ggpubr在散点图上对其进行绘制:

library(ggpubr)
df = data.frame(x,y)
ggscatter(df, x = "x", y = "y", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson")

您将获得以下输出:

enter image description here

是否使用for循环进行多次关联

# generating a dataframe with multiple vector to compare:
df = NULL 
for(i in 1:5)
{
  df = data.frame(cbind(df,rnorm(1000)))
}

# Testing the correlation between columns 1 and all other columns using for loop
for(i in 1:ncol(df))
{
  if(i==1){correlation = cor(df[,1],df[,i])}
  else{correlation = c(correlation, cor(df[,1],df[,i]))}
}

> correlation
[1]  1.00000000 -0.05276680 -0.03968104 -0.02960876  0.01861618

# Using apply
correlation = as.vector(apply(df,2,function(x){cor(x,df[,1])}))

> correlation
[1]  1.00000000 -0.05276680 -0.03968104 -0.02960876  0.01861618

答案 1 :(得分:2)

没有理由为什么任何人都需要下面的功能。这只是R的算术运算中实现的Pearson相关公式。

cor2 <- function(x, y, na.rm = FALSE){
  if(na.rm){
    x <- x[!is.na(x)]
    y <- y[!is.na(y)]
  }
  stopifnot(length(x) == length(y))
  n <- length(x)
  sum.x <- 0
  sum.y <- 0
  sum.x2 <- 0
  sum.y2 <- 0
  sum.xy <- 0
  for(i in seq_along(x)){
    sum.x <- sum.x + x[i]
    sum.y <- sum.y + y[i]
    sum.x2 <- sum.x2 + x[i]^2
    sum.y2 <- sum.y2 + y[i]^2
    sum.xy <- sum.xy + x[i]*y[i]
  }
  numer <- n*sum.xy - sum.x*sum.y
  denom <- sqrt(n*sum.x2 - sum.x^2)*sqrt(n*sum.y2 - sum.y^2)
  numer/denom
}

set.seed(1234)

x <- rnorm(20)
y <- rexp(20)

cor(x, y)
#[1] -0.07445358
cor2(x, y)
#[1] -0.07445358

这些结果不是identical()

identical(cor(x, y),cor2(x, y))
#[1] FALSE
all.equal(cor(x, y),cor2(x, y))
#[1] TRUE
cor(x, y) - cor2(x, y)
#[1] -4.163336e-17

但是内置功能要快得多。

x <- rnorm(1000)
y <- rexp(1000)

microbenchmark::microbenchmark(
  base = cor(x, y),
  rui = cor2(x, y)
)
#Unit: microseconds
# expr     min      lq      mean  median       uq     max neval cld
# base  40.936  42.797  45.54924  43.482  44.8155 101.172   100  a 
#  rui 479.102 481.738 496.01586 483.190 491.5015 690.619   100   b