协方差矩阵分组

时间:2014-02-09 04:24:16

标签: r matrix covariance

我已经能够计算我的大数据集的协方差:

  

cov(MyMatrix,use =“pairwise.complete.obs”,method =“pearson”)

这提供了我正在寻找的协方差表,以及处理整个数据中的NA问题。然而,为了进行更深入的分析,我想创建协方差矩阵,分别处理我的数据集中的800多个组(一些有40多个观察,另一些只有1个)。我试过(来自http://www.mail-archive.com/r-help@r-project.org/msg86328.html):

  

lapply(list(cov),by,data = MyMatrix [8:13],INDICES = MyMatrix [“Group”])

这给了我以下错误:

  

tapply错误(seq_len(6L),list(MyMatrix["Group"] = NA_real_),function(x):     参数必须具有相同的长度

这让我觉得代码的问题涉及到缺失的NA数据,所以我尝试将“use =”pairwise.complete.obs“,method =”pearson“”短语合并到lapply代码中并且无法获取它工作。我不确定最适合它的地方,所以我试着把它粘在各处:

  

lapply(list(cov),use =“pairwise.complete.obs”,method =“pearson”),by,data = MyMatrix [8:13],INDICES = MyMatrix [“Group”])

     

lapply(list(cov),by,data = PhenoMtrix [8:13],INDICES = PhenoMtrix [“Group”],use =“pairwise.complete.obs”,method =“pearson”)

这显然是草率的,不起作用,所以我有点卡住了。在此先感谢您的帮助!

我的数据格式如下:

  

分组HML RML FML TML FHD BIB

 1      323.50    248.75     434.50    355.75    46.84    NA

 2        NA      238.50     441.50    353.00    45.83    277.0

 2      309.50    227.75     419.00    332.25    46.39    284.0

3 个答案:

答案 0 :(得分:1)

如果您提供了数据(或所有数据)的示例,那么 会更好,但由于您没有,

# create sample data
set.seed(1)
MyMatrix <- data.frame(group=rep(1:5, each=100),matrix(rnorm(2500),ncol=5))
# generate list of covariance matrices by group
cov.list <- lapply(unique(MyMatrix$group),
                   function(x)cov(MyMatrix[MyMatrix$group==x,-1],
                                  use="na.or.complete"))
cov.list[1]
# [[1]]
#             X1          X2          X3          X4          X5
# X1  0.80676209 -0.09541458 -0.12704666 -0.04122976  0.08636307
# X2 -0.09541458  0.93350463 -0.05197573 -0.06457299 -0.02203141
# X3 -0.12704666 -0.05197573  1.06030090  0.07324986  0.01840894
# X4 -0.04122976 -0.06457299  0.07324986  1.12059428  0.02385031
# X5  0.08636307 -0.02203141  0.01840894  0.02385031  1.11101410

在此示例中,我们创建了一个名为MyMatrix的数据框,其中包含六列。第一个是group,其他五个是X1, X2, ... X5,包含我们希望关联的数据。希望这与数据集的结构类似。

代码的操作行是:

cov.list <- lapply(unique(MyMatrix$group),
                   function(x)cov(MyMatrix[MyMatrix$group==x,-1],
                                  use="na.or.complete"))

这将获取组ID的列表(来自unique(MyMatrix$group))并使用它们中的每一个调用该函数。该函数计算除{1}之外的所有列的协方差矩阵,对于相关组中的所有行,并将结果存储在5个元素列表中(在此示例中有5个组)。

注意:关于如何处理NA。实际上有几种选择;你应该查看?cov上的文档,看看它们是什么。此处选择的方法MyMatrix仅在计算中包含任何列中无NA值的行。如果对于给定的组,没有这样的行,use="na.or.complete"将返回NA。还有其他几种选择。

答案 1 :(得分:0)

您也可以尝试:

  by(MyMatrix[-1],MyMatrix$group,cov,use="na.or.complete")

答案 2 :(得分:0)

您还可以先将数据框转换为列表,然后使用lapply在该列表结构上运行cov函数,并为每个组返回协方差矩阵的列表。

您的示例数据框太小,无法回答您的问题,因此我使用了@jlhoward和您的某些列名类似的示例数据:

#Create sample dataframe and rename the columns based on the initial question 
MyDataframe <- data.frame(group=rep(1:5, each=10),matrix(rnorm(250),ncol=5))
colnames(MyDataframe) <- c("Group", "HML", "RML", "FML", "TML", "FHD")

#Split the dataframe columns HML, RML, FML, TML, and FHD into lists based on group membership, and call the new list MyList
MyList <- split(MyDataframe[ ,c("HML", "RML", "FML", "TML", "FHD")], list(Group = MyDataframe$Group))
> MyList
$`1`
          HML        RML        FML         TML         FHD
1   1.6806547 -1.2357861 -0.1438550 -1.79852015 -0.18745361
2  -1.2750024 -1.0973354  0.8654817  0.51666643  1.23240278
3   0.2381941 -1.1605690 -1.1124618 -1.24223216 -0.47014275
4  -0.6592671  0.6749256  0.3744053  2.82355336  0.04349764
5  -1.2026018  2.2036865 -0.6543408  0.05235647 -0.88794230
6   0.7946254  1.9786356 -0.1276282  0.37147386  2.23512260
7  -0.3166249 -1.0072974 -2.0800837 -0.31275558  0.88379182
8  -0.1662388  1.3819116 -3.1629656 -0.86033274 -0.31272981
9  -0.4666707 -0.8104205  1.0934703 -0.02459932 -0.35725108
10 -0.8385697  1.7204379 -0.3447757  0.18629448  0.42084553

$`2`
           HML         RML        FML           TML        FHD
11  0.03604007  0.58921306 -0.2066693 -1.0887154121  0.2790660
12  0.81767599  0.08703872 -0.1476078  0.1261011136  0.3525258
13 -0.19341506  0.31941568 -1.2553003  0.2419955263 -0.3152117
14  0.36065670 -0.77353050 -0.2166640  0.0001615059  0.7663386
15 -1.62885990  0.18124576  0.8299511 -1.0140332552 -0.9668448
16 -0.44847189 -1.57839214 -0.6470409 -0.0612936448 -0.3844145
17 -0.11144444 -0.65229817  0.6505128 -0.0882344334 -0.3144284
18  0.74339324  1.78857053 -1.2333200 -0.9063703037 -0.0765000
19  1.51958249 -0.56289571  0.2964601 -0.0287684624  0.3151081
20 -1.56974385  0.28559655 -2.7583618  0.3632164248 -0.1410783

$`3`
          HML         RML        FML         TML         FHD
21  1.8902244 -1.32617251 -0.3473238 -0.14714488 -0.20950269
22  0.7233421  1.87021160  0.5498787 -0.21878322 -0.25967403
23 -0.4488791  0.40916110 -0.2716354  0.68897421 -0.87347369
24 -0.4013050  0.41924705  0.6404477  0.81811788  1.24055660
25  1.1542181  0.75534163  0.1067173  0.32427043 -0.85858957
26 -1.3252742 -0.09989574  1.4557291 -0.62678378  0.04029924
27  0.2694684 -0.16238724 -0.6138011  0.07998383 -0.78157860
28 -0.8149025  0.77406215 -0.6921972  0.21223283 -0.86679556
29 -0.4916411 -0.80898776 -0.9372076 -1.44085453  1.18841866
30  0.3670508  1.45821533 -1.2531432  0.23593131 -1.17231457

$`4`
           HML        RML        FML         TML         FHD
31 -0.91753704  1.5976080  1.9286179 -0.88697107  0.85215534
32  0.57087719  1.2202687  0.5791964  1.98994106  0.68640384
33  0.79562327  1.0253044  0.5356456  0.31906648 -1.06342199
34 -0.06380725 -0.5774832  0.7260138 -0.93905123  1.88579741
35  0.24285367 -1.3862499  0.2853635 -1.27603774  0.07991027
36  1.15532419  0.4545112  0.3121971 -0.80544639 -0.74762482
37  1.30120698  1.3480126 -0.1012468  0.03093374 -0.74170584
38 -0.04423831  0.9100061 -2.1983937 -0.88974443  0.50814835
39 -1.71264891 -0.1225082  0.5095046 -1.28680921 -0.37710894
40 -0.11079800 -0.6806858 -0.9002725 -0.70797874  0.49889563

$`5`
          HML         RML        FML        TML         FHD
41 -0.6549724  0.77703431 -0.7953904 -0.7044253  0.73765368
42 -2.3945883  1.16952896  1.8286481 -0.8116904 -0.59562563
43  1.3470786 -0.26396886  0.3858448 -0.1839417  0.66618305
44 -0.4450848  0.71092152 -0.7665068 -0.1213066 -1.33159041
45  0.2621206 -0.05290252 -0.2817160 -1.1119020  0.53377605
46 -1.8713943 -0.82580895  0.5590292  0.5474239  1.85929122
47  2.1826177 -1.88918691 -0.2495949 -0.7371631  1.33998290
48 -0.2294448  1.04252185 -1.3311849  0.0447891  0.48173560
49  0.2250941 -0.37240902  1.1648265 -0.4848731 -0.06271555
50 -0.4131518 -0.94258989 -0.3291930 -1.7198636 -0.80485465

#Compute the covariance matrix for each group in MyList, and here I specify the columns, which is always good practice
within.group.cov <- lapply(MyList, function(x) cov(x[c("HML", "RML", "FML", "TML", "FHD")], use="pairwise.complete.obs"))
> within.group.cov
$`1`
           HML        RML        FML        TML       FHD
HML  0.8421194 -0.3185286 -0.1532037 -0.6192220 0.1086547
RML -0.3185286  2.1271906 -0.3640053  0.5995860 0.1025134
FML -0.1532037 -0.3640053  1.7151420  0.6879402 0.2133205
TML -0.6192220  0.5995860  0.6879402  1.5581052 0.2910293
FHD  0.1086547  0.1025134  0.2133205  0.2910293 0.8965167

$`2`
           HML         RML         FML         TML         FHD
HML 1.00082711  0.02754603  0.28260061  0.03354966  0.33759313
RML 0.02754603  0.84357799 -0.32547765 -0.24018033 -0.02572972
FML 0.28260061 -0.32547765  1.13751383 -0.22229766 -0.03210768
TML 0.03354966 -0.24018033 -0.22229766  0.29451851  0.06511133
FHD 0.33759313 -0.02572972 -0.03210768  0.06511133  0.24129973

$`3`
            HML         RML         FML         TML        FHD
HML  0.95281484 -0.06072793 -0.18608614  0.08682029 -0.2241560
RML -0.06072793  0.94498926  0.05826306  0.26708137 -0.3414530
FML -0.18608614  0.05826306  0.68405108  0.02649167  0.2239994
TML  0.08682029  0.26708137  0.02649167  0.43268293 -0.2285798
FHD -0.22415605 -0.34145303  0.22399939 -0.22857978  0.7388127

$`4`
           HML        RML         FML         TML         FHD
HML  0.8545644  0.2010881 -0.18227287  0.43631519 -0.31002651
RML  0.2010881  1.0269461  0.16025720  0.53798506 -0.20676662
FML -0.1822729  0.1602572  1.18667903  0.11072818  0.07572155
TML  0.4363152  0.5379851  0.11072818  0.99720184 -0.07098033
FHD -0.3100265 -0.2067666  0.07572155 -0.07098033  0.82212198

$`5`
           HML         RML         FML         TML         FHD
HML  1.8208574 -0.73730457 -0.43570215 -0.13051837  0.30984815
RML -0.7373046  0.98584765 -0.06674645  0.10796767 -0.43046406
FML -0.4357021 -0.06674645  0.93344577 -0.00659714 -0.03835305
TML -0.1305184  0.10796767 -0.00659714  0.40967714  0.26304517
FHD  0.3098481 -0.43046406 -0.03835305  0.26304517  0.97107321