我试图获取我的数据的描述性统计数据。我经历了很多建议。但是,我只是想知道是否有任何软件包可以对下面提供的数据格式执行描述性统计。
head(mydata)
X A1 A2 A3 M1 M2 M3 U1 U2 U3
1 A A A M M M U U U
2 X1 100 200 250 200 230 400 400 100 200
3 X2 600 300 400 300 550 750 800 900 540
4 X3 500 300 200 200 200 100 500 400 600
数据包含列上的样本和行上的变量。第一行是样本名称,第二行是组(A,M,U)。我想获得每个组的描述性统计数据。例如,对于组A(A1,A2,A3)的均值,sd ....谁能告诉我怎样才能做到这一点。我已经看到了描述性统计数据的大部分答案,而且它们用于列。 如果问题不明确,请告诉我。 谢谢你的帮助。
希格斯
答案 0 :(得分:2)
@Phil对他的推荐是正确的。
你在Hadley的书中学到的一个关键原则是整洁的数据原则(非常基本:列中的变量,行中的单个观察)。如果您想快速了解整洁的数据,请试试vignette。
有多种方法可以修复和分析您的数据,但这里有一个使用“tidyverse'”工具的示例。
# Load useful 'tidy data' packages
library(tidyverse)
# Make 'mydata'
mydata <- data_frame(X = c('', 'X1', 'X2', 'X3'),
A1 = c('A', 100, 600, 500),
A2 = c('A', 200, 300, 300),
A3 = c('A', 250, 400, 200),
M1 = c('M', 200, 300, 200),
M2 = c('M', 230, 550, 200),
M3 = c('M', 400, 750, 100),
U1 = c('U', 400, 800, 500),
U2 = c('U', 100, 900, 400),
U3 = c('U', 200, 540, 600))
# View 'mydata'
mydata
#> # A tibble: 4 x 10
#> X A1 A2 A3 M1 M2 M3 U1 U2 U3
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "" A A A M M M U U U
#> 2 X1 100 200 250 200 230 400 400 100 200
#> 3 X2 600 300 400 300 550 750 800 900 540
#> 4 X3 500 300 200 200 200 100 500 400 600
转换为整洁的数据框
# Transpose rows and columns and convert resulting matrix back into a dataframe
mydata_new <- as_data_frame(t(mydata))
# View 'mydata_new'
mydata_new
#> # A tibble: 10 x 4
#> V1 V2 V3 V4
#> <chr> <chr> <chr> <chr>
#> 1 "" X1 X2 X3
#> 2 A 100 600 500
#> 3 A 200 300 300
#> 4 A 250 400 200
#> 5 M 200 300 200
#> 6 M 230 550 200
#> 7 M 400 750 100
#> 8 U 400 800 500
#> 9 U 100 900 400
#> 10 U 200 540 600
# Clean 'mydata_new'
## Add column names
colnames(mydata_new) <- c('Group', 'X1', 'X2', 'X3')
## Remove first row
mydata_new <- mydata_new[-1, ]
# View cleaned 'mydata_new'
mydata_new
#> # A tibble: 9 x 4
#> Group X1 X2 X3
#> <chr> <chr> <chr> <chr>
#> 1 A 100 600 500
#> 2 A 200 300 300
#> 3 A 250 400 200
#> 4 M 200 300 200
#> 5 M 230 550 200
#> 6 M 400 750 100
#> 7 U 400 800 500
#> 8 U 100 900 400
#> 9 U 200 540 600
现在总结数据。
# Summarise numeric data
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate MEAN and SD for each data column
summarise_at(vars(starts_with('X')),
funs(MEAN = mean, SD = sd))
#> # A tibble: 3 x 7
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153.
#> 2 M 277. 533. 167. 108. 225. 57.7
#> 3 U 233. 747. 500 153. 186. 100
更新:2018年5月10日关于添加变异系数的查询。
变异系数不是基本R函数,因此创建用户定义的函数。
# Define function: (cv = sd / mean)
coef_var = function(x) {
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}
使用添加的摘要功能重新执行摘要
# Execute summary
mydata_new %>%
# Convert all data columns from 'character' to 'numeric'
mutate_at(vars(starts_with('X')),
as.numeric) %>%
# Group data by the grouping variable before summarising
group_by(Group) %>%
# Calculate summaries each data column
## Call the summary functions with a dummy "." argument so that
## Additional arguments can be added to the called functions
## (e.g., adding na.rm = TRUE to cope with missing data)
## See ?dplyr::funs for details
summarise_at(vars(starts_with('X')),
funs(MEAN = mean(., na.rm = TRUE), # Mean
SD = sd(., na.rm = TRUE), # SD
CV = coef_var, # Coefficient of variation
# Add other summary stats as needed
MEDIAN = median(., na.rm = TRUE), # Median
Q25 = quantile(., prob = 0.25, na.rm = TRUE), # 25th percentile
Q75 = quantile(., prob = 0.75, na.rm = TRUE), # 75th percentile
min = min(., na.rm = TRUE), # Minimum
max = max(., na.rm = TRUE))) # Maximum
#> # A tibble: 3 x 25
#> Group X1_MEAN X2_MEAN X3_MEAN X1_SD X2_SD X3_SD X1_CV X2_CV X3_CV
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 183. 433. 333. 76.4 153. 153. 0.417 0.353 0.458
#> 2 M 277. 533. 167. 108. 225. 57.7 0.390 0.423 0.346
#> 3 U 233. 747. 500 153. 186. 100 0.655 0.249 0.2
#> # ... with 15 more variables: X1_MEDIAN <dbl>, X2_MEDIAN <dbl>,
#> # X3_MEDIAN <dbl>, X1_Q25 <dbl>, X2_Q25 <dbl>, X3_Q25 <dbl>,
#> # X1_Q75 <dbl>, X2_Q75 <dbl>, X3_Q75 <dbl>, X1_min <dbl>, X2_min <dbl>,
#> # X3_min <dbl>, X1_max <dbl>, X2_max <dbl>, X3_max <dbl>
由reprex package(v0.2.0)创建于2018-05-10。