我想通过主题如何在双线图中聚类来创建数据框的子集

时间:2019-06-25 14:08:19

标签: r pca unsupervised-learning biplot

这是我正在研究的双性人之一。圆圈代表我要从中创建子集数据框的集群

enter image description here

如果我对顶部群集感兴趣,如何选择位于矩形-.1

我无法共享数据,但是我们可以练习使用光圈组。

enter image description here

library("ISLR")
biplot(prcomp(iris[,1:4]))

假设我对矩形-.125 中的数据感兴趣

如何识别该数据并从中创建一个子集?

1 个答案:

答案 0 :(得分:0)

您可以使用。$ x访问投影点:

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

编辑:对于集群,我将使用一种算法(此处为kmeans)来实现:

pc_res <- prcomp(iris[,1:4])
str(pc_res) # find that the data is stored in .$x
#> List of 5
#>  $ sdev    : num [1:4] 2.056 0.493 0.28 0.154
#>  $ rotation: num [1:4, 1:4] 0.3614 -0.0845 0.8567 0.3583 -0.6566 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#>   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
#>  $ center  : Named num [1:4] 5.84 3.06 3.76 1.2
#>   ..- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
#>  $ scale   : logi FALSE
#>  $ x       : num [1:150, 1:4] -2.68 -2.71 -2.89 -2.75 -2.73 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
#>  - attr(*, "class")= chr "prcomp"
dframe <- as.data.frame(pc_res$x)
sub_res <- subset(x = dframe, subset = -.125 < dframe$PC1 &
                          dframe$PC1 <.75 &
                          -.15 < dframe$PC2 &
                          dframe$PC2 < 1.0)
head(sub_res)
#>             PC1       PC2         PC3          PC4
#> 54  0.183317720 0.8279590  0.17959139  0.093566840
#> 56  0.641669084 0.4182469 -0.04107609 -0.243116767
#> 60 -0.008745404 0.7230819 -0.28114143 -0.005618918
#> 62  0.511698557 0.1039812 -0.13054775  0.050719232
#> 63  0.264976508 0.5500365  0.69414683  0.057185519
#> 67  0.660283762 0.3529697 -0.32802753 -0.187878621

# if you want cluster from projection on (PC1, PC2)
dframe <- as.data.frame(prcomp(iris[,1:4])$x)
classif <- kmeans(x = dframe[,1:2], centers = 3, iter.max = 100, nstart = 10)
classif
#> K-means clustering with 3 clusters of sizes 61, 39, 50
#> 
#> Cluster means:
#>         PC1        PC2
#> 1  0.665676  0.3316042
#> 2  2.346527 -0.2739386
#> 3 -2.642415 -0.1908850
#> 
#> Clustering vector:
#>   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#>  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
#> [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
#> [141] 2 2 1 2 2 2 1 2 2 1
#> 
#> Within cluster sum of squares by cluster:
#> [1] 31.87959 18.87111 13.06924
#>  (between_SS / total_SS =  90.4 %)
#> 
#> Available components:
#> 
#> [1] "cluster"      "centers"      "totss"        "withinss"    
#> [5] "tot.withinss" "betweenss"    "size"         "iter"        
#> [9] "ifault"

# check visually your groups
str(classif)
#> List of 9
#>  $ cluster     : int [1:150] 3 3 3 3 3 3 3 3 3 3 ...
#>  $ centers     : num [1:3, 1:2] 0.666 2.347 -2.642 0.332 -0.274 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:3] "1" "2" "3"
#>   .. ..$ : chr [1:2] "PC1" "PC2"
#>  $ totss       : num 666
#>  $ withinss    : num [1:3] 31.9 18.9 13.1
#>  $ tot.withinss: num 63.8
#>  $ betweenss   : num 602
#>  $ size        : int [1:3] 61 39 50
#>  $ iter        : int 2
#>  $ ifault      : int 0
#>  - attr(*, "class")= chr "kmeans"
classif$centers
#>         PC1        PC2
#> 1  0.665676  0.3316042
#> 2  2.346527 -0.2739386
#> 3 -2.642415 -0.1908850
dframe$group <- classif$cluster
plot(x = dframe$PC1, y = dframe$PC2, col = dframe$group) # so you want group with minimal center

最后一个词:关于最佳聚类,SO上有一个非常好的图形化答案:cluster-analysis-in-r-determine-the-optimal-number-of-clusters。另外,有些软件包还允许您像 result <- dframe[dframe$group == 1,] # or subset(x = dframe, subset = dframe$group == 1) head(result) #> PC1 PC2 PC3 PC4 group #> 52 0.9324885 -0.31833364 0.01801419 0.0005665121 1 #> 54 0.1833177 0.82795901 0.17959139 0.0935668402 1 #> 55 1.0881033 -0.07459068 0.30775790 0.1120205742 1 #> 56 0.6416691 0.41824687 -0.04107609 -0.2431167665 1 #> 57 1.0950607 -0.28346827 -0.16981024 -0.0835565724 1 #> 58 -0.7491227 1.00489096 -0.01230292 -0.0179077226 1 ,...

一样使用ggplot2