我难以执行迭代定义的计算。以下数据作为示例(实际数据集更大):
## DATA ##
# Columns
Individual<-c("A","B","C","D","E","F","G","H1","H2","H3","H4","H5","K1","K2","K3","K4","K5")
P1<-c(0,0,"A",0,"C","C",0, rep("E",5),"H1","H2","H3","H4","H5")
P2<-c(0,0,"B",0,"D", "E",0,rep("G",5),"H1","H2","H3","H4","H5")
# Dataframe
myd<-data.frame(Individual,P1,P2,stringsAsFactors=FALSE)
Individual P1 P2
1 A 0 0
2 B 0 0
3 C A B
4 D 0 0
5 E C D
6 F C E
7 G 0 0
8 H1 E G
9 H2 E G
10 H3 E G
11 H4 E G
12 H5 E G
13 K1 H1 H1
14 K2 H2 H2
15 K3 H3 H3
16 K4 H4 H4
17 K5 H5 H5
数据代表个人与父母P1
,P2
之间的关系。
所需的计算,标记为relationA
,表示每个人与A的相关程度。
根据定义,A和A之间的关系的值为1.所有其他个体的值需要根据表中的信息计算,如下所示:
The value of relationA for an individual should be equal to
1/2 (the value of relationA of P1 of the individual)
+ 1/2 (the value of relationA of P2 of the individual)
例如
Individual P1 P2 relationA
1 A 0 0 1
2 B 0 0 0
3 C A B (A = 1 + B = 0)/2 = 0.5
4 D 0 0 0
5 E C D (C= 0.5 + D = 0)/2 = 0.25
6 F C E (C = 0.5 + E = 0.25)/2 = 0.375
预期输出如下:
Individual P1 P2 relationA
1 A 0 0 1
2 B 0 0 0
3 C A B 0.5
4 D 0 0 0
5 E C D 0.25
6 F C E 0.375
7 G 0 0 0
8 H1 E G 0.125
9 H2 E G 0.125
10 H3 E G 0.125
11 H4 E G 0.125
12 H5 E G 0.125
13 K1 H1 H1 0.125
14 K2 H2 H2 0.125
15 K3 H3 H3 0.125
16 K4 H4 H4 0.125
17 K5 H5 H5 0.125
我的困难是在R
中以适当的方式表达这一点。任何帮助,将不胜感激。
答案 0 :(得分:4)
您可以编写一个函数来计算给定个体的值,并(隐式地)将关系计算为一个简单的递归函数。
relationA <- function(ind) {
if(ind == "A") {
1
} else if (ind == "0") {
0
} else {
pts <- myd[myd$Individual == ind,]
(relationA(pts[["P1"]]) + relationA(pts[["P2"]])) / 2
}
}
简单地说,如果个人是A,则为1;如果个体为0,则为0;对于任何其他内容,递归调用与该个体对应的每个父(relationA
和P1
)的P2
并将它们加在一起并除以2.这仅适用于一次一个人:
> relationA("A")
[1] 1
> relationA("F")
[1] 0.375
> relationA("K5")
[1] 0.125
但你可以相对容易地在所有人身上进行矢量化:
> sapply(myd$Individual, relationA)
A B C D E F G H1 H2 H3 H4 H5 K1
1.000 0.000 0.500 0.000 0.250 0.375 0.000 0.125 0.125 0.125 0.125 0.125 0.125
K2 K3 K4 K5
0.125 0.125 0.125 0.125
可以使用
将其分配回myd
myd$relationA <- sapply(myd$Individual, relationA)
这不是特别有效,因为它必须针对每种情况一遍又一遍地计算relationA
。当它到达“K5”时,它会调用reationA("H5")
两次,每次调用relationA("E")
和relationA("G")
,然后调用relationA("C")
,relationA("D")
,{{ 1}}和relationA("0")
等等。也就是说,没有结果被缓存,而是每次重新计算。对于这个小数据集来说,这并不重要,因为即使效率低下仍然非常快。
如果您希望/需要缓存结果并使用该缓存,则可以修改relationA("0")
来执行此操作。
relationA
然后你必须初始化缓存:
relationAc <- function(ind) {
pts <- myd[myd$Individual == ind,]
if(nrow(pts) == 0 | any(is.na(pts[["relationA"]]))) {
relationA <-
if(ind == "A") {
1
} else if (ind == "0") {
0
} else {
(relationAc(pts[["P1"]]) + relationAc(pts[["P2"]])) / 2
}
myd[myd$Individual == ind, "relationA"] <<- relationA
relationA
} else {
pts[["relationA"]]
}
}
单个调用将填充所需的值,并且调用整个个体集将导致填写所有值。
myd$relationA <- NA_real_
答案 1 :(得分:3)
更简洁,您可以使用sapply
和rowSums
将for-loop
转换为一行代码:
# Initialize values of relationA
myd$relationA <- 0
myd$relationA[myd$Individual=="A"] <- 1
# Calculate relationA
myd$relationA <- myd$relationA + rowSums(sapply(myd$Individual, function(indiv)
myd$relationA[myd$Individual==indiv]/2 * ((myd$P1==indiv) + (myd$P2==indiv))))
<小时/>
你在寻找这样的东西吗?
# Initialize values of relationA
myd$relationA <- 0
myd$relationA[myd$Individual=="A"] <- 1
# Iterate over all Individuals
for (indiv in myd$Individual) {
indiVal <- myd$relationA[myd$Individual==indiv]
# all columns handled at once, thanks to vectorization; no need for myd$P1[i]
myd$relationA <- myd$relationA +
indiVal/2 * ((myd$P1==indiv) + (myd$P2==indiv))
}
<强>输出强>
myd
Individual P1 P2 relationA
1 A 0 0 1.000
2 B 0 0 0.000
3 C A B 0.500
4 D 0 0 0.000
5 E C D 0.250
6 F C E 0.375
7 G 0 0 0.000
8 H1 E G 0.125
9 H2 E G 0.125
...