我想使用describe()函数创建数据框。正在考虑的数据集是虹膜。数据框应如下所示:
Variable n missing unique Info Mean 0.05 0.1 0.25 0.5 0.75 0.9 0.95
Sepal.Length 150 0 35 1 5.843 4.6 4.8 5.1 5.8 6.4 6.9 7.255
Sepal.Width 150 0 23 0.99 3.057 2.345 2.5 2.8 3 3.3 3.61 3.8
Petal.Length 150 0 43 1 3.758 1.3 1.4 1.6 4.35 5.1 5.8 6.1
Petal.Width 150 0 22 0.99 1.199 0.2 0.2 0.3 1.3 1.8 2.2 2.3
Species 150 0 3
有没有办法将describe()的输出强制转换为data.frame类型?当我试图强迫时,我收到一个错误,如下所示:
library(Hmisc)
statistics <- describe(iris)
statistics[1]
first_vec <- statistics[1]$Sepal.Length
as.data.frame(first_vec)
#Error in as.data.frame.default(first_vec) : cannot coerce class ""describe"" to a data.frame
由于
答案 0 :(得分:4)
解决这个问题的方法是使用str()
检查对象:
data(iris)
library(Hmisc)
di <- describe(iris)
di
# iris
#
# 5 Variables 150 Observations
# -------------------------------------------------------------
# Sepal.Length
# n missing unique Info Mean .05 .10 .25 .50 .75 .90 .95
# 150 0 35 1 5.843 4.600 4.800 5.100 5.800 6.400 6.900 7.255
#
# lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
# -------------------------------------------------------------
# ...
# -------------------------------------------------------------
# Species
# n missing unique
# 150 0 3
#
# setosa (50, 33%), versicolor (50, 33%)
# virginica (50, 33%)
# -------------------------------------------------------------
str(di)
# List of 5
# $ Sepal.Length:List of 6
# ..$ descript : chr "Sepal.Length"
# ..$ units : NULL
# ..$ format : NULL
# ..$ counts : Named chr [1:12] "150" "0" "35" "1" ...
# .. ..- attr(*, "names")= chr [1:12] "n" "missing" "unique" "Info" ...
# ..$ intervalFreq:List of 2
# .. ..$ range: atomic [1:2] 4.3 7.9
# .. .. ..- attr(*, "Csingle")= logi TRUE
# .. ..$ count: int [1:100] 1 0 3 0 0 1 0 0 4 0 ...
# ..$ values : Named chr [1:10] "4.3" "4.4" "4.5" "4.6" ...
# .. ..- attr(*, "names")= chr [1:10] "L1" "L2" "L3" "L4" ...
# ..- attr(*, "class")= chr "describe"
# $ Sepal.Width :List of 6
# ...
# $ Species :List of 5
# ..$ descript: chr "Species"
# ..$ units : NULL
# ..$ format : NULL
# ..$ counts : Named num [1:3] 150 0 3
# .. ..- attr(*, "names")= chr [1:3] "n" "missing" "unique"
# ..$ values : num [1:2, 1:3] 50 33 50 33 50 33
# .. ..- attr(*, "dimnames")=List of 2
# .. .. ..$ : chr [1:2] "Frequency" "%"
# .. .. ..$ : chr [1:3] "setosa" "versicolor" "virginica"
# ..- attr(*, "class")= chr "describe"
# - attr(*, "descript")= chr "iris"
# - attr(*, "dimensions")= int [1:2] 150 5
# - attr(*, "class")= chr "describe"
我们看到di
是一个列表列表。我们可以通过查看第一个子列表来区分它们。您可以将其转换为矢量:
unlist(di[[1]])
# descript counts.n
# "Sepal.Length" "150"
# counts.missing counts.unique
# "0" "35"
# counts.Info counts.Mean
# "1" "5.843"
# counts..05 counts..10
# "4.600" "4.800"
# counts..25 counts..50
# "5.100" "5.800"
# counts..75 counts..90
# "6.400" "6.900"
# counts..95 intervalFreq.range1
# "7.255" "4.3"
# intervalFreq.range2 intervalFreq.count1
# "7.9" "1"
# ...
# values.H3 values.H2
# "7.6" "7.7"
# values.H1
# "7.9"
str(unlist(di[[1]]))
# Named chr [1:125] "Sepal.Length" "150" "0" "35" ...
# - attr(*, "names")= chr [1:125] "descript" "counts.n" "counts.missing" "counts.unique" ...
非常非常长(125)。这些元素被强制为具有相同(且最具包容性)的类型,即字符。看来你想要第2到第12个元素:
unlist(di[[1]])[2:12]
# counts.n counts.missing counts.unique counts.Info
# "150" "0" "35" "1"
# counts.Mean counts..05 counts..10 counts..25
# "5.843" "4.600" "4.800" "5.100"
# counts..50 counts..75 counts..90
# "5.800" "6.400" "6.900"
现在你有了一些可以开始使用的东西了。但请注意,这似乎只是数值变量的情况;因子变量species
是不同的:
unlist(di[[5]])
# descript counts.n counts.missing counts.unique
# "Species" "150" "0" "3"
# values1 values2 values3 values4
# "50" "33" "50" "33"
# values5 values6
# "50" "33"
在这种情况下,你似乎只想要第二到第四个元素。
使用此发现和解决问题的过程,您可以看到如何将describe
的输出分开并将所需信息放入数据框中。但是,这需要做很多工作。你可能需要使用循环和许多if(){ ... } else{ ... }
块。您可能只想从头开始编写自己的数据集描述函数。
答案 1 :(得分:3)
您可以使用stat.desc
包中的pastecs
功能执行此操作:
library(pastecs)
summary_df <- stat.desc(mydata)
summary_df
是您想要的数据框。查看更多信息here。
答案 2 :(得分:0)
在R中,您只需要使用summary(iris)
函数而不是Python中的describe(iris)
函数。