找出两个分类变量的比率

时间:2018-03-30 19:42:57

标签: r plot ggplot2 dplyr

我有Rank,Status&计为通过聚合父数据帧创建的数据帧。我想找到比例/百分比如下。

,即完整&之间的不完整百分比/比率是多少?每个等级的总分不完整。

Rank Status `n()`
   <fct> <fct>       <int> <ratio>
 1 A     Incomplete   602  
 2 A     Complete   9443    602/9443
 3 B     Incomplete  1425
 4 B     Complete  10250    ----
 5 C     Incomplete  1347   ----
 6 C     Complete   6487
 7 D     Incomplete  1118
 8 D     Complete   3967
 9 E     Incomplete   715
10 E     Complete   1948

我尝试了sapply()迭代&amp;计算比率&amp;将其存储在另一个df中。但有没有更好的方法呢?

否则,如果堆积条形图可以标记上面的百分比/比率,那就太棒了。

我试过的堆积条显示总数的百分比而不是比率。

感谢。

4 个答案:

答案 0 :(得分:2)

使用dplyr

library(dplyr)

df <- data.frame(Rank = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E"),
             Status = c("Incomplete", "Complete","Incomplete", "Complete",
                        "Incomplete", "Complete","Incomplete", "Complete",
                        "Incomplete", "Complete"),
             Count = c(602, 9443, 1425, 10250, 1347, 6487, 1118, 3967, 715, 1948))

# Ratio
df %>% group_by(Rank) %>% mutate(Ratio = Count/sum(Count))
# A tibble: 10 x 4
# Groups:   Rank [5]
#   Rank  Status      Count  Ratio
#   <fct> <fct>       <dbl>  <dbl>
# 1 A     Incomplete   602. 0.0599
# 2 A     Complete    9443. 0.940 
# 3 B     Incomplete  1425. 0.122 
# 4 B     Complete   10250. 0.878 
# 5 C     Incomplete  1347. 0.172 
# 6 C     Complete    6487. 0.828 
# 7 D     Incomplete  1118. 0.220 
# 8 D     Complete    3967. 0.780 
# 9 E     Incomplete   715. 0.268 
#10 E     Complete    1948. 0.732 

# Percentage
df %>% group_by(Rank) %>% mutate(Percentage = (Count/sum(Count))*100)
# A tibble: 10 x 4
# Groups:   Rank [5]
#   Rank  Status      Count Percentage
#   <fct> <fct>       <dbl>      <dbl>
# 1 A     Incomplete   602.       5.99
# 2 A     Complete    9443.       94.0 
# 3 B     Incomplete  1425.       12.2 
# 4 B     Complete   10250.       87.8 
# 5 C     Incomplete  1347.       17.2 
# 6 C     Complete    6487.       82.8 
# 7 D     Incomplete  1118.       22.0 
# 8 D     Complete    3967.       78.0 
# 9 E     Incomplete   715.       26.8 
#10 E     Complete    1948.       73.2 

答案 1 :(得分:1)

dcast

中使用data.table

<强>代码:

library('data.table')
dcast(setDT(df), formula = Rank~Status, value.var = "count")[, ratio := Incomplete / Complete][]

如果您在给定排名中有重复状态,例如排名A有两个不完整状态,计数为602和605,那么这将处理它。

dcast(setDT(df2)[, .(count = sum(count)), by = .(Rank, Status)],  # sum count by Status and Rank
      formula = Rank~Status, value.var = "count")[, ratio := Incomplete / Complete][]

<强>输出:

没有重复状态

#    Rank Complete Incomplete      ratio
# 1:    A     9443        602 0.06375093
# 2:    B    10250       1425 0.13902439
# 3:    C     6487       1347 0.20764606
# 4:    D     3967       1118 0.28182506
# 5:    E     1948        715 0.36704312

重复状态

#    Rank Complete Incomplete     ratio
# 1:    A     9443       1207 0.1278195
# 2:    B    10250       1425 0.1390244
# 3:    C     6487       1347 0.2076461
# 4:    D     3967       1118 0.2818251
# 5:    E     1948        715 0.3670431

数据:

没有重复状态

df <- read.table(text='Rank Status `n()`
                 1 A     Incomplete   602  
                 2 A     Complete   9443
                 3 B     Incomplete  1425
                 4 B     Complete  10250
                 5 C     Incomplete  1347
                 6 C     Complete   6487
                 7 D     Incomplete  1118
                 8 D     Complete   3967
                 9 E     Incomplete   715
                 10 E     Complete   1948')
colnames(df)[3] <- 'count'

有重复状态:

df2 <- read.table(text='Rank Status `n()`
                 1 A     Incomplete   602  
                 2 A     Incomplete   605
                 2.1 A     Complete   9443
                 3 B     Incomplete  1425
                 4 B     Complete  10250
                 5 C     Incomplete  1347
                 6 C     Complete   6487
                 7 D     Incomplete  1118
                 8 D     Complete   3967
                 9 E     Incomplete   715
                 10 E     Complete   1948')
colnames(df2)[3] <- 'count'

答案 2 :(得分:0)

我没有使用dplyr包,但我认为以下逻辑可行: 假设你的数据帧是df

# creating sample script as yours
p <- c("Incomplete","Complete","Incomplete","Complete","Incomplete","Complete")
q <- c(604,9443,1425,10250,1347,6487)

# ignoring the ranks
df <- data.frame("Status" = p,"counts" = q)


ratiovector <- sample(c(0),size = NROW(df), replace = T)
kcomp <- which(df$Status == "Complete")
kincomp <- which(df$Status == "Incomplete")
ratiovector[kcomp] <- df$counts[kincomp]/df$counts[kcomp]
dfnew <- cbind(df,"ratio" = ratiovector)
# print dfnew
dfnew
# if you want it in string form convert it.

答案 3 :(得分:0)

在基地R:

df$ratio <- ave(df$Count,df$Rank,FUN=function(x)x/sum(x))
#    Rank     Status Count      ratio
# 1     A Incomplete   602 0.05993031
# 2     A   Complete  9443 0.94006969
# 3     B Incomplete  1425 0.12205567
# 4     B   Complete 10250 0.87794433
# 5     C Incomplete  1347 0.17194281
# 6     C   Complete  6487 0.82805719
# 7     D Incomplete  1118 0.21986234
# 8     D   Complete  3967 0.78013766
# 9     E Incomplete   715 0.26849418
# 10    E   Complete  1948 0.73150582