R-查找数据框中列的出现次数最多/最少的值和比率

时间:2019-11-26 11:43:23

标签: r dataframe

鉴于以下数据,

text = "
name,param1,param2,param3,param4,param5
A,1,a,false,,64ms
B,1,a,false,,32ms
C,1,b,false,,128ms
D,1,a,true,,32ms
E,1,b,false,,128ms
"
df = read.table(textConnection(text), sep=",", header = T)

我正在尝试为params1params5的每一列查找最通用的值。对于单列,可以使用table函数,如下所示。

> table(df$param5)/nrow(df)

128ms  32ms  64ms 
 0.4    0.4    0.2 

虽然这对于一次检查列很有用,但我真正要做的是一次完成所有列。我怎么知道的?

预期输出为

+-----------------------------+--------+--------+--------+--------+--------+
|                             | param1 | param2 | param3 | param4 | param5 |
+-----------------------------+--------+--------+--------+--------+--------+
| most_common_value           |      1 | a      | false  | NA     | 32ms   |
| ratio_of_most_common_value  |      1 | 0.6    | 0.8    | 1.0    | 0.4    |
| least_common_value          |      1 | b      | true   | NA     | 64ms   |
| ratio_of_least_common_value |      1 | 0.4    | 0.2    | 1.0    | 0.2    |
| unique_values               |      1 | 2      | 2      | 1      | 3      |
+-----------------------------+--------+--------+--------+--------+--------+

3 个答案:

答案 0 :(得分:1)

这是东西:

n <- nrow(df)

sapply(
  df[grep("param", names(df))],
  function(x) {
    ourt <- sort(table(x, useNA = "ifany"), decreasing = TRUE)
    nt <- length(ourt)
    c(
      most_common_value = names(ourt)[1],
      ratio_of_most_common_value = ourt[1] / n,
      least_common_value = names(ourt)[nt],
      ratio_of_least_common_value = ourt[nt] / n,
      unique_values = nt
    )
  }
)


                              param1 param2 param3  param4 param5 
most_common_value             "1"    "a"    "false" NA     "128ms"
ratio_of_most_common_value.1  "1"    "0.6"  "0.8"   "1"    "0.4"  
least_common_value            "1"    "b"    "true"  NA     "64ms" 
ratio_of_least_common_value.1 "1"    "0.4"  "0.2"   "1"    "0.2"  
unique_values                 "1"    "2"    "2"     "1"    "3"    

答案 1 :(得分:0)

您可以使用软件包frq的函数sjmisc

> library(sjmisc)
> frq(df,param1:param5)

param1 <integer>
# total N=5  valid N=5  mean=1.00  sd=0.00

 val frq raw.prc valid.prc cum.prc
   1   5     100       100     100
  NA   0       0        NA      NA


param2 <categorical>
# total N=5  valid N=5  mean=1.40  sd=0.55

  val frq raw.prc valid.prc cum.prc
    a   3      60        60      60
    b   2      40        40     100
 <NA>   0       0        NA      NA


param3 <categorical>
# total N=5  valid N=5  mean=1.20  sd=0.45

   val frq raw.prc valid.prc cum.prc
 false   4      80        80      80
  true   1      20        20     100
  <NA>   0       0        NA      NA


param4 <lgl>
# total N=5  valid N=0  mean=NaN  sd=NA

 val frq raw.prc valid.prc cum.prc
  NA   5     100        NA      NA


param5 <categorical>
# total N=5  valid N=5  mean=1.80  sd=0.84

   val frq raw.prc valid.prc cum.prc
 128ms   2      40        40      40
  32ms   2      40        40      80
  64ms   1      20        20     100
  <NA>   0       0        NA      NA

输出不是您所期望的,但是可以推导出来。

答案 2 :(得分:0)

如果您使用的是base R,那么以下代码可能会有所帮助。

l <- lapply(names(lst<-sapply(df, function(v) table(v,exclude = NULL)/nrow(df))[-1]), function(k) {
  setNames(data.frame(head(names(lst[[k]]),1), 
             as.numeric(head(lst[[k]],1)), 
             tail(names(lst[[k]]),1), 
             as.numeric(tail(lst[[k]],1)), 
             length(unique(lst[[k]]))),
           c("most_common_value","ratio_of_most_common_value","least_common_value","ratio_of_least_common_value","unique_values"))
})
res <- setNames(data.frame(t(Reduce(rbind,l))),names(lst))

如此

> res
                            param1 param2 param3 param4 param5
most_common_value                1      a  false   <NA>  128ms
ratio_of_most_common_value     1.0    0.6    0.8    1.0    0.4
least_common_value               1      b   true   <NA>   64ms
ratio_of_least_common_value    1.0    0.4    0.2    1.0    0.2
unique_values                    1      2      2      1      2