手动计算分类评分的计数数据的差异

时间:2017-08-02 14:13:29

标签: r

我试图从分类评级计数数据中手动计算方差(和均值)。

Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)

Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)

Data

  Item Never Rarely Occasionally Sometimes Frequently Usually Always
1    A     4     NA           17        10          3       2      7
2    B    12     10            5        12         21      14     NA
3    C    17     20           12        17         NA      12     18
4    D    NA     15            6        NA         16      20     23

每个分类评级都有一个等效的数值(1:7)。我已经计算了每个项目的平均数字评级如下:

Rating_wt <- 1:7 # Vector of weights for each frequency rating
Rating.wt.mat <- rep(Rating_wt,each=dim(Data[,2:8])[1])
Data$Avg_rating <- rowSums(Data[,2:8]*Rating.wt.mat,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE)

Data

  Item Never Rarely Occasionally Sometimes Frequently Usually Always Avg_rating
1    A     4     NA           17        10          3       2      7   3.976744
2    B    12     10            5        12         21      14     NA   3.837838
3    C    17     20           12        17         NA      12     18   3.739583
4    D    NA     15            6        NA         16      20     23   5.112500

我还想计算每个平均值的方差,并将其存储为数据中的新变量。

我认为我需要从每个数字评级中减去每个项目的平均值,并将该值乘以每个相应单元格中的计数,然后将这些结果与行相加,然后除以每行中的总计数。

但是,我无法弄清楚如何设置元素计算来实现这一目标。

从概念上讲,我认为它应该是这样的:

Data$Rating_var <- rowSums((Numeric_Rating - Avg_rating)*Value,na.rm=TRUE)/rowSums(Data[,2:8],na.rm=TRUE))

Numeric_Rating对应Rating_wt

Never = 1
Rarely = 2
Occasionally = 3
Sometimes = 4
Frequently = 5
Usually = 6
Always = 7

Value是每个Numeric_RatingItem交叉点的相应单元格。

1 个答案:

答案 0 :(得分:1)

我建议您在应用计算之前尝试重塑数据集,因为它会更容易。

library(dplyr)
library(tidyr)


Item <- c("A", "B", "C", "D")
cat1 <- c(4,12,17,NA)
cat2 <- c(NA,10,20,15)
cat3 <- c(17,5,12,6)
cat4 <- c(10,12,17,NA)
cat5 <- c(3,21,NA,16)
cat6 <- c(2,14,12,20)
cat7 <- c(7,NA,18,23)

Data <- data.frame(Item=Item, Never=cat1,Rarely=cat2,Occasionally=cat3, Sometimes=cat4,Frequently=cat5,Usually=cat6,Always=cat7,stringsAsFactors=FALSE)


Data %>%
  gather(category, value, -Item) %>%                                                  # reshape dataset
  mutate(Rating = recode(category, "Never"=1,"Rarely" = 2,"Occasionally" = 3,         
                                   "Sometimes" = 4,"Frequently" = 5,
                                   "Usually" = 6,"Always" = 7)) %>%                   # assign rating 
  group_by(Item) %>%                                                                  # for each item
  mutate(Avg = sum(Rating*value, na.rm=T) / sum(value, na.rm=T),                      # calculate Avg
         variance = sum(abs(Rating - Avg)*value, na.rm=T) / sum(value, na.rm=T)) %>%  # calculate Variance using the Avg
  ungroup() %>%                                                                       # forget the grouping
  select(-Rating) %>%                                                                 # no need the rating any more
  spread(category, value) %>%                                                         # reshape back to original form
  select_(.dots = c(names(Data), "Avg", "variance"))                                  # get columns in the desired order


# # A tibble: 4 x 10
#    Item Never Rarely Occasionally Sometimes Frequently Usually Always      Avg variance
# * <chr> <dbl>  <dbl>        <dbl>     <dbl>      <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
# 1     A     4     NA           17        10          3       2      7 3.976744 1.326122
# 2     B    12     10            5        12         21      14     NA 3.837838 1.530314
# 3     C    17     20           12        17         NA      12     18 3.739583 1.879991
# 4     D    NA     15            6        NA         16      20     23 5.112500 1.529062

尝试逐步运行管道传输过程,看看它是如何工作的,特别是如果您不熟悉dplyrtidyr语法。