变量

时间:2017-09-04 03:18:58

标签: r

我有一个像这样的数据集:

test <-
    data.frame(
        variable    = c("A","A","B","B","C","D","E","E","E","F","F","G"), 
        confidence  = c(1,0.6,0.1,0.15,1,0.3,0.4,0.5,0.2,1,0.4,0.9),          
        freq        = c(2,2,2,2,1,1,3,3,3,2,2,1),
        weight      = c(2,2,0,0,1,3,5,5,5,0,0,4)
    )

> test
   variable confidence freq weight
1         A       1.00    2      2
2         A       0.60    2      2
3         B       0.10    2      0
4         B       0.15    2      0
5         C       1.00    1      1
6         D       0.30    1      3
7         E       0.40    3      5
8         E       0.50    3      5
9         E       0.20    3      5
10        F       1.00    2      0
11        F       0.40    2      0
12        G       0.90    1      4

我想通过每个变量的置信度来计算权重之和,如下所示: Ecuation,其中i是变量(A,B,C ......)

开发上述公式:

w[1]c[1]+w[1]c[2]=2*1+2*0.6=3.2
w[2]c[1]+w[2]c[2]
w[3]c[3]+w[3]c[4]
w[4]c[3]+w[4]c[4]
w[5]c[5]
w[6]c[6]
w[7]c[7]+w[7]c[8]+w[7]c[9]
w[8]c[7]+w[8]c[8]+w[8]c[9]
w[9]c[7]+w[9]c[8]+w[9]c[9]
…

结果应如下所示:

> test
   variable confidence freq weight SWC
1         A       1.00    2      2 3.2
2         A       0.60    2      2 3.2
3         B       0.10    2      0 0.0
4         B       0.15    2      0 0.0
5         C       1.00    1      1 1.0
6         D       0.30    1      3 0.9
7         E       0.40    3      5 5.5
8         E       0.50    3      5 5.5
9         E       0.20    3      5 5.5
10        F       1.00    2      0 0.0
11        F       0.40    2      0 0.0
12        G       0.90    1      4 3.6

请注意,每个观察值的置信度值不同,但每个变量具有相同的权重,因此对于每个相同的变量观察值,我需要的总和是相同的。

首先,我尝试使用以下方法多次迭代每个变量:

> table(test$variable)

A B C D E F G 
2 2 1 1 3 2 1 

但我无法使其发挥作用。那么,我计算了每个变量开始的位置,试图使for循环仅在这些值中迭代:

> tpos = cumsum(table(test$variable))
> tpos = tpos+1
> tpos
 A  B  C  D  E  F  G 
 3  5  6  7 10 12 13 
> tpos = shift(tpos, 1)
> tpos
[1] NA  3  5  6  7 10 12
> tpos[1]=1
> tpos
[1]  1  3  5  6  7 10 12

# tpos is a vector with the positions where each variable (A, B, c...) start

> tposn = c(1:nrow(test))[-tpos]
> tposn
[1]  2  4  8  9 11
> c(1:nrow(test))[-tposn]
[1]  1  3  5  6  7 10 12

# then i came up with this loop but it doesn't give the correct result

for(i in 1:nrow(test)[-tposn]){
    a = test$freq[i]-1
    test$SWC[i:i+a] = sum(test$weight[i]*test$confidence[i:i+a])
    }

也许有一种更简单的方法吗? tapply?

1 个答案:

答案 0 :(得分:3)

使用dplyr

library(dplyr)

test %>% 
  group_by(variable) %>%
  mutate(SWC=sum(confidence*weight))

# A tibble: 12 x 5
# Groups:   variable [7]
variable confidence  freq weight   SWC
<fctr>      <dbl> <dbl>  <dbl> <dbl>
1        A       1.00     2      2   3.2
2        A       0.60     2      2   3.2
3        B       0.10     2      0   0.0
4        B       0.15     2      0   0.0
5        C       1.00     1      1   1.0
6        D       0.30     1      3   0.9
7        E       0.40     3      5   5.5
8        E       0.50     3      5   5.5
9        E       0.20     3      5   5.5
10       F       1.00     2      0   0.0
11       F       0.40     2      0   0.0
12       G       0.90     1      4   3.6