[我是R的新手...]我有dataframe:
df1 <- data.frame(c(2,1,2), c(1,2,3,4,5,6), seq(141,170)) #create data.frame
names(df1) <- c('gender', 'age', 'height') #column names
我希望数据框对象中的df1
摘要如下所示:
count mean std min 25% 50% 75% max
age 30.0000 3.5000 1.7370 1.0000 2.0000 3.5000 5.0000 6.0000
gender 30.0000 1.6667 0.4795 1.0000 1.0000 2.0000 2.0000 2.0000
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000
我在Python中用df1.describe().T
生成了这个。我怎么能在R?中做到这一点?
如果我的摘要数据框包含“dtype”,“null”(NULL
值的数量),(“数量”)“唯一”和“范围”值以及综合摘要统计:
count mean std min 25% 50% 75% max null unique range dtype
age 30.0000 3.5000 1.7370 1.0000 2.0000 3.5000 5.0000 6.0000 0 6 5 int64
gender 30.0000 1.6667 0.4795 1.0000 1.0000 2.0000 2.0000 2.0000 0 2 1 int64
height 30.0000 155.5000 8.8034 141.0000 148.2500 155.5000 162.7500 170.0000 0 30 29 int64
上述结果的Python代码是:
df1.describe().T.join(pd.DataFrame(df1.isnull().sum(), columns=['null']))\
.join(pd.DataFrame.from_dict({i:df1[i].nunique() for i in df1.columns}, orient='index')\
.rename(columns={0:'unique'}))\
.join(pd.DataFrame.from_dict({i:(df1[i].max() - df1[i].min()) for i in df1.columns}, orient='index')\
.rename(columns={0:'range'}))\
.join(pd.DataFrame(df1.dtypes, columns=['dtype']))
谢谢!
答案 0 :(得分:2)
使用这些库可以非常轻松地读取这些内容 - tidyr
,dplyr
library("tidyr")
library("dplyr")
df1 <- data.frame(c(2,1,2), c(1,2,3,4,5,6), seq(141,170)) #create data.frame
names(df1) <- c('gender', 'age', 'height') #column names
df2<- gather(df1,"attributes","value")
df2 %>% group_by(attributes) %>% summarise(count = n(), mean = mean(value), med = median(value),sd = sd(value), min = min(value), max = max(value))
# A tibble: 3 x 7
# attributes count mean med sd min max
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 age 30 3.500000 3.5 1.7370208 1 6
# 2 gender 30 1.666667 2.0 0.4794633 1 2
# 3 height 30 155.500000 155.5 8.8034084 141 170
答案 1 :(得分:1)
我通常使用一个小功能(改编自网上的脚本)来进行这种转换:
sumstats = function(x) {
null.k <- function(x) sum(is.na(x))
unique.k <- function(x) {if (sum(is.na(x)) > 0) length(unique(x)) - 1
else length(unique(x))}
range.k <- function(x) max(x, na.rm=TRUE) - min(x, na.rm=TRUE)
mean.k=function(x) {if (is.numeric(x)) round(mean(x, na.rm=TRUE), digits=2)
else "N*N"}
sd.k <- function(x) {if (is.numeric(x)) round(sd(x, na.rm=TRUE), digits=2)
else "N*N"}
min.k <- function(x) {if (is.numeric(x)) round(min(x, na.rm=TRUE), digits=2)
else "N*N"}
q05 <- function(x) quantile(x, probs=.05, na.rm=TRUE)
q10 <- function(x) quantile(x, probs=.1, na.rm=TRUE)
q25 <- function(x) quantile(x, probs=.25, na.rm=TRUE)
q50 <- function(x) quantile(x, probs=.5, na.rm=TRUE)
q75 <- function(x) quantile(x, probs=.75, na.rm=TRUE)
q90 <- function(x) quantile(x, probs=.9, na.rm=TRUE)
q95 <- function(x) quantile(x, probs=.95, na.rm=TRUE)
max.k <- function(x) {if (is.numeric(x)) round(max(x, na.rm=TRUE), digits=2)
else "N*N"}
sumtable <- cbind(as.matrix(colSums(!is.na(x))), sapply(x, null.k), sapply(x, unique.k), sapply(x, range.k), sapply(x, mean.k), sapply(x, sd.k),
sapply(x, min.k), sapply(x, q05), sapply(x, q10), sapply(x, q25), sapply(x, q50),
sapply(x, q75), sapply(x, q90), sapply(x, q95), sapply(x, max.k))
sumtable <- as.data.frame(sumtable); names(sumtable) <- c('count', 'null', 'unique',
'range', 'mean', 'std', 'min', '5%', '10%', '25%', '50%', '75%', '90%',
'95%', 'max')
return(sumtable)
}
sumstats(df1)
count null unique range mean std var min 5% 10% 25% 50% 75% 90% 95% max
gender 30.00 0.00 2.00 1.00 1.67 0.48 0.23 1.00 1.00 1.00 1.00 2.00 2.00 2.00 2.00 2.00
age 30.00 0.00 6.00 5.00 3.50 1.74 3.02 1.00 1.00 1.00 2.00 3.50 5.00 6.00 6.00 6.00
height 30.00 0.00 30.00 29.00 155.50 8.80 77.50 141.00 142.45 143.90 148.25 155.50 162.75 167.10 168.55 170.00
您可以轻松地对其进行调整以添加更多描述性列,例如分位数,空值,范围等。它确实返回data.frame。您还可能希望事先在参数中指定NAs的行为。
希望它有所帮助。