鉴于以下数据,
text = "
name,param1,param2,param3,param4,param5
A,1,a,false,,64ms
B,1,a,false,,32ms
C,1,b,false,,128ms
D,1,a,true,,32ms
E,1,b,false,,128ms
"
df = read.table(textConnection(text), sep=",", header = T)
我正在尝试为params1
至params5
的每一列查找最通用的值。对于单列,可以使用table
函数,如下所示。
> table(df$param5)/nrow(df)
128ms 32ms 64ms
0.4 0.4 0.2
虽然这对于一次检查列很有用,但我真正要做的是一次完成所有列。我怎么知道的?
预期输出为
+-----------------------------+--------+--------+--------+--------+--------+
| | param1 | param2 | param3 | param4 | param5 |
+-----------------------------+--------+--------+--------+--------+--------+
| most_common_value | 1 | a | false | NA | 32ms |
| ratio_of_most_common_value | 1 | 0.6 | 0.8 | 1.0 | 0.4 |
| least_common_value | 1 | b | true | NA | 64ms |
| ratio_of_least_common_value | 1 | 0.4 | 0.2 | 1.0 | 0.2 |
| unique_values | 1 | 2 | 2 | 1 | 3 |
+-----------------------------+--------+--------+--------+--------+--------+
答案 0 :(得分:1)
这是东西:
n <- nrow(df)
sapply(
df[grep("param", names(df))],
function(x) {
ourt <- sort(table(x, useNA = "ifany"), decreasing = TRUE)
nt <- length(ourt)
c(
most_common_value = names(ourt)[1],
ratio_of_most_common_value = ourt[1] / n,
least_common_value = names(ourt)[nt],
ratio_of_least_common_value = ourt[nt] / n,
unique_values = nt
)
}
)
param1 param2 param3 param4 param5
most_common_value "1" "a" "false" NA "128ms"
ratio_of_most_common_value.1 "1" "0.6" "0.8" "1" "0.4"
least_common_value "1" "b" "true" NA "64ms"
ratio_of_least_common_value.1 "1" "0.4" "0.2" "1" "0.2"
unique_values "1" "2" "2" "1" "3"
答案 1 :(得分:0)
您可以使用软件包frq
的函数sjmisc
。
> library(sjmisc)
> frq(df,param1:param5)
param1 <integer>
# total N=5 valid N=5 mean=1.00 sd=0.00
val frq raw.prc valid.prc cum.prc
1 5 100 100 100
NA 0 0 NA NA
param2 <categorical>
# total N=5 valid N=5 mean=1.40 sd=0.55
val frq raw.prc valid.prc cum.prc
a 3 60 60 60
b 2 40 40 100
<NA> 0 0 NA NA
param3 <categorical>
# total N=5 valid N=5 mean=1.20 sd=0.45
val frq raw.prc valid.prc cum.prc
false 4 80 80 80
true 1 20 20 100
<NA> 0 0 NA NA
param4 <lgl>
# total N=5 valid N=0 mean=NaN sd=NA
val frq raw.prc valid.prc cum.prc
NA 5 100 NA NA
param5 <categorical>
# total N=5 valid N=5 mean=1.80 sd=0.84
val frq raw.prc valid.prc cum.prc
128ms 2 40 40 40
32ms 2 40 40 80
64ms 1 20 20 100
<NA> 0 0 NA NA
输出不是您所期望的,但是可以推导出来。
答案 2 :(得分:0)
如果您使用的是base R
,那么以下代码可能会有所帮助。
l <- lapply(names(lst<-sapply(df, function(v) table(v,exclude = NULL)/nrow(df))[-1]), function(k) {
setNames(data.frame(head(names(lst[[k]]),1),
as.numeric(head(lst[[k]],1)),
tail(names(lst[[k]]),1),
as.numeric(tail(lst[[k]],1)),
length(unique(lst[[k]]))),
c("most_common_value","ratio_of_most_common_value","least_common_value","ratio_of_least_common_value","unique_values"))
})
res <- setNames(data.frame(t(Reduce(rbind,l))),names(lst))
如此
> res
param1 param2 param3 param4 param5
most_common_value 1 a false <NA> 128ms
ratio_of_most_common_value 1.0 0.6 0.8 1.0 0.4
least_common_value 1 b true <NA> 64ms
ratio_of_least_common_value 1.0 0.4 0.2 1.0 0.2
unique_values 1 2 2 1 2