我已经能够计算我的大数据集的协方差:
cov(MyMatrix,use =“pairwise.complete.obs”,method =“pearson”)
这提供了我正在寻找的协方差表,以及处理整个数据中的NA问题。然而,为了进行更深入的分析,我想创建协方差矩阵,分别处理我的数据集中的800多个组(一些有40多个观察,另一些只有1个)。我试过(来自http://www.mail-archive.com/r-help@r-project.org/msg86328.html):
lapply(list(cov),by,data = MyMatrix [8:13],INDICES = MyMatrix [“Group”])
这给了我以下错误:
tapply错误(seq_len(6L),list(
MyMatrix["Group"]
= NA_real_),function(x): 参数必须具有相同的长度
这让我觉得代码的问题涉及到缺失的NA数据,所以我尝试将“use =”pairwise.complete.obs“,method =”pearson“”短语合并到lapply代码中并且无法获取它工作。我不确定最适合它的地方,所以我试着把它粘在各处:
lapply(list(cov),use =“pairwise.complete.obs”,method =“pearson”),by,data = MyMatrix [8:13],INDICES = MyMatrix [“Group”])
lapply(list(cov),by,data = PhenoMtrix [8:13],INDICES = PhenoMtrix [“Group”],use =“pairwise.complete.obs”,method =“pearson”)
这显然是草率的,不起作用,所以我有点卡住了。在此先感谢您的帮助!
我的数据格式如下:
分组HML RML FML TML FHD BIB
1 323.50 248.75 434.50 355.75 46.84 NA 2 NA 238.50 441.50 353.00 45.83 277.0 2 309.50 227.75 419.00 332.25 46.39 284.0
答案 0 :(得分:1)
如果您提供了数据(或所有数据)的示例,那么 会更好,但由于您没有,
# create sample data
set.seed(1)
MyMatrix <- data.frame(group=rep(1:5, each=100),matrix(rnorm(2500),ncol=5))
# generate list of covariance matrices by group
cov.list <- lapply(unique(MyMatrix$group),
function(x)cov(MyMatrix[MyMatrix$group==x,-1],
use="na.or.complete"))
cov.list[1]
# [[1]]
# X1 X2 X3 X4 X5
# X1 0.80676209 -0.09541458 -0.12704666 -0.04122976 0.08636307
# X2 -0.09541458 0.93350463 -0.05197573 -0.06457299 -0.02203141
# X3 -0.12704666 -0.05197573 1.06030090 0.07324986 0.01840894
# X4 -0.04122976 -0.06457299 0.07324986 1.12059428 0.02385031
# X5 0.08636307 -0.02203141 0.01840894 0.02385031 1.11101410
在此示例中,我们创建了一个名为MyMatrix
的数据框,其中包含六列。第一个是group
,其他五个是X1, X2, ... X5
,包含我们希望关联的数据。希望这与数据集的结构类似。
代码的操作行是:
cov.list <- lapply(unique(MyMatrix$group),
function(x)cov(MyMatrix[MyMatrix$group==x,-1],
use="na.or.complete"))
这将获取组ID的列表(来自unique(MyMatrix$group)
)并使用它们中的每一个调用该函数。该函数计算除{1}之外的所有列的协方差矩阵,对于相关组中的所有行,并将结果存储在5个元素列表中(在此示例中有5个组)。
注意:关于如何处理NA。实际上有几种选择;你应该查看?cov上的文档,看看它们是什么。此处选择的方法MyMatrix
仅在计算中包含任何列中无NA值的行。如果对于给定的组,没有这样的行,use="na.or.complete"
将返回NA。还有其他几种选择。
答案 1 :(得分:0)
您也可以尝试:
by(MyMatrix[-1],MyMatrix$group,cov,use="na.or.complete")
答案 2 :(得分:0)
您还可以先将数据框转换为列表,然后使用lapply在该列表结构上运行cov函数,并为每个组返回协方差矩阵的列表。
您的示例数据框太小,无法回答您的问题,因此我使用了@jlhoward和您的某些列名类似的示例数据:
#Create sample dataframe and rename the columns based on the initial question
MyDataframe <- data.frame(group=rep(1:5, each=10),matrix(rnorm(250),ncol=5))
colnames(MyDataframe) <- c("Group", "HML", "RML", "FML", "TML", "FHD")
#Split the dataframe columns HML, RML, FML, TML, and FHD into lists based on group membership, and call the new list MyList
MyList <- split(MyDataframe[ ,c("HML", "RML", "FML", "TML", "FHD")], list(Group = MyDataframe$Group))
> MyList
$`1`
HML RML FML TML FHD
1 1.6806547 -1.2357861 -0.1438550 -1.79852015 -0.18745361
2 -1.2750024 -1.0973354 0.8654817 0.51666643 1.23240278
3 0.2381941 -1.1605690 -1.1124618 -1.24223216 -0.47014275
4 -0.6592671 0.6749256 0.3744053 2.82355336 0.04349764
5 -1.2026018 2.2036865 -0.6543408 0.05235647 -0.88794230
6 0.7946254 1.9786356 -0.1276282 0.37147386 2.23512260
7 -0.3166249 -1.0072974 -2.0800837 -0.31275558 0.88379182
8 -0.1662388 1.3819116 -3.1629656 -0.86033274 -0.31272981
9 -0.4666707 -0.8104205 1.0934703 -0.02459932 -0.35725108
10 -0.8385697 1.7204379 -0.3447757 0.18629448 0.42084553
$`2`
HML RML FML TML FHD
11 0.03604007 0.58921306 -0.2066693 -1.0887154121 0.2790660
12 0.81767599 0.08703872 -0.1476078 0.1261011136 0.3525258
13 -0.19341506 0.31941568 -1.2553003 0.2419955263 -0.3152117
14 0.36065670 -0.77353050 -0.2166640 0.0001615059 0.7663386
15 -1.62885990 0.18124576 0.8299511 -1.0140332552 -0.9668448
16 -0.44847189 -1.57839214 -0.6470409 -0.0612936448 -0.3844145
17 -0.11144444 -0.65229817 0.6505128 -0.0882344334 -0.3144284
18 0.74339324 1.78857053 -1.2333200 -0.9063703037 -0.0765000
19 1.51958249 -0.56289571 0.2964601 -0.0287684624 0.3151081
20 -1.56974385 0.28559655 -2.7583618 0.3632164248 -0.1410783
$`3`
HML RML FML TML FHD
21 1.8902244 -1.32617251 -0.3473238 -0.14714488 -0.20950269
22 0.7233421 1.87021160 0.5498787 -0.21878322 -0.25967403
23 -0.4488791 0.40916110 -0.2716354 0.68897421 -0.87347369
24 -0.4013050 0.41924705 0.6404477 0.81811788 1.24055660
25 1.1542181 0.75534163 0.1067173 0.32427043 -0.85858957
26 -1.3252742 -0.09989574 1.4557291 -0.62678378 0.04029924
27 0.2694684 -0.16238724 -0.6138011 0.07998383 -0.78157860
28 -0.8149025 0.77406215 -0.6921972 0.21223283 -0.86679556
29 -0.4916411 -0.80898776 -0.9372076 -1.44085453 1.18841866
30 0.3670508 1.45821533 -1.2531432 0.23593131 -1.17231457
$`4`
HML RML FML TML FHD
31 -0.91753704 1.5976080 1.9286179 -0.88697107 0.85215534
32 0.57087719 1.2202687 0.5791964 1.98994106 0.68640384
33 0.79562327 1.0253044 0.5356456 0.31906648 -1.06342199
34 -0.06380725 -0.5774832 0.7260138 -0.93905123 1.88579741
35 0.24285367 -1.3862499 0.2853635 -1.27603774 0.07991027
36 1.15532419 0.4545112 0.3121971 -0.80544639 -0.74762482
37 1.30120698 1.3480126 -0.1012468 0.03093374 -0.74170584
38 -0.04423831 0.9100061 -2.1983937 -0.88974443 0.50814835
39 -1.71264891 -0.1225082 0.5095046 -1.28680921 -0.37710894
40 -0.11079800 -0.6806858 -0.9002725 -0.70797874 0.49889563
$`5`
HML RML FML TML FHD
41 -0.6549724 0.77703431 -0.7953904 -0.7044253 0.73765368
42 -2.3945883 1.16952896 1.8286481 -0.8116904 -0.59562563
43 1.3470786 -0.26396886 0.3858448 -0.1839417 0.66618305
44 -0.4450848 0.71092152 -0.7665068 -0.1213066 -1.33159041
45 0.2621206 -0.05290252 -0.2817160 -1.1119020 0.53377605
46 -1.8713943 -0.82580895 0.5590292 0.5474239 1.85929122
47 2.1826177 -1.88918691 -0.2495949 -0.7371631 1.33998290
48 -0.2294448 1.04252185 -1.3311849 0.0447891 0.48173560
49 0.2250941 -0.37240902 1.1648265 -0.4848731 -0.06271555
50 -0.4131518 -0.94258989 -0.3291930 -1.7198636 -0.80485465
#Compute the covariance matrix for each group in MyList, and here I specify the columns, which is always good practice
within.group.cov <- lapply(MyList, function(x) cov(x[c("HML", "RML", "FML", "TML", "FHD")], use="pairwise.complete.obs"))
> within.group.cov
$`1`
HML RML FML TML FHD
HML 0.8421194 -0.3185286 -0.1532037 -0.6192220 0.1086547
RML -0.3185286 2.1271906 -0.3640053 0.5995860 0.1025134
FML -0.1532037 -0.3640053 1.7151420 0.6879402 0.2133205
TML -0.6192220 0.5995860 0.6879402 1.5581052 0.2910293
FHD 0.1086547 0.1025134 0.2133205 0.2910293 0.8965167
$`2`
HML RML FML TML FHD
HML 1.00082711 0.02754603 0.28260061 0.03354966 0.33759313
RML 0.02754603 0.84357799 -0.32547765 -0.24018033 -0.02572972
FML 0.28260061 -0.32547765 1.13751383 -0.22229766 -0.03210768
TML 0.03354966 -0.24018033 -0.22229766 0.29451851 0.06511133
FHD 0.33759313 -0.02572972 -0.03210768 0.06511133 0.24129973
$`3`
HML RML FML TML FHD
HML 0.95281484 -0.06072793 -0.18608614 0.08682029 -0.2241560
RML -0.06072793 0.94498926 0.05826306 0.26708137 -0.3414530
FML -0.18608614 0.05826306 0.68405108 0.02649167 0.2239994
TML 0.08682029 0.26708137 0.02649167 0.43268293 -0.2285798
FHD -0.22415605 -0.34145303 0.22399939 -0.22857978 0.7388127
$`4`
HML RML FML TML FHD
HML 0.8545644 0.2010881 -0.18227287 0.43631519 -0.31002651
RML 0.2010881 1.0269461 0.16025720 0.53798506 -0.20676662
FML -0.1822729 0.1602572 1.18667903 0.11072818 0.07572155
TML 0.4363152 0.5379851 0.11072818 0.99720184 -0.07098033
FHD -0.3100265 -0.2067666 0.07572155 -0.07098033 0.82212198
$`5`
HML RML FML TML FHD
HML 1.8208574 -0.73730457 -0.43570215 -0.13051837 0.30984815
RML -0.7373046 0.98584765 -0.06674645 0.10796767 -0.43046406
FML -0.4357021 -0.06674645 0.93344577 -0.00659714 -0.03835305
TML -0.1305184 0.10796767 -0.00659714 0.40967714 0.26304517
FHD 0.3098481 -0.43046406 -0.03835305 0.26304517 0.97107321