Question

我在 R 中有一个空间数据框。我们有一个类不平衡问题，所以我希望能够删除正例（我们的响应变量是二进制的，正值大约是数据集的 10%），然后选择一部分对抗模型中的类不平衡的负面案例。我想选择在空间上密切相关的负面案例，我真的很难弄清楚如何。

我想到的一些可能可行的想法

KNN 对负面案例进行聚类
覆盖空间网格并从每个网格方块中提取 x 个样本
缓冲区分析并在缓冲区内随机选择

如果有人有关于如何在 R 中执行它的建议，那就太棒了。

谢谢

Answer 1

只是在这里回答以防其他人搜索此内容。

我决定使用 kmeans 集群，然后将集群作为 col 添加到 dB 并从集群中随机采样。

下面的代码！

    ##CLuster analysis set.seed(1) clusdb <- W_neg[c( 
                      "x_coor_farm", "y_coor_farm", 
                      "Area_Farm_SqM", "NatGrass_1km_buff",
                      "BioFor_1km_buff", "MixedFor_1km_buff", 
                      "Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436",  "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436")]

##Write functuon to loop the algorithim kmean_withinss <- function(k) {   cluster <- kmeans(clusdb, k)   return (cluster$tot.withinss) }

# Set maximum cluster  max_k <-20 
# Run algorithm over a range of k  wss <- sapply(2:max_k, kmean_withinss)

#Dataframe of kmeans output to find optimal K elbow <-data.frame(2:max_k, wss)

#plot library(ggplot2) ggplot(elbow, aes(x = X2.max_k, y = wss)) +   geom_point() +   geom_line() +   scale_x_continuous(breaks = seq(1, 20, by = 1))


#Optimal K = 8
#Re-run the model with optimal K

pc_cluster_2 <-kmeans(clusdb, 8) pc_cluster_2$cluster pc_cluster_2$centers pc_cluster_2$size

pc_cluster_2$totss pc_cluster_2$betweenss

pc_cluster_2$betweenss/pc_cluster_2$totss*100
#92% 

#Add col to dataframe W_neg$cluster <-pc_cluster_2$cluster


W_neg <- W_neg[c("TB2017",   "x_coor_farm", "y_coor_farm",    "Area_Farm_SqM", "NatGrass_1km_buff",   "BioFor_1km_buff", "MixedFor_1km_buff",    "Area_Cut_012", "Area_Cut_1224", "Area_Cut_2436", "Cut_Count_012", "Cut_Count_1224", "Cut_Count_2436", "cluster")]

ggplot(data = W_neg, aes(y = cluster)) +   geom_bar(aes(fill = TB2017)) +   ggtitle("Count of Clusters by Region") +   theme(plot.title = element_text(hjust = 0.5))

fviz_cluster(pc_cluster_2, data = scale(clusdb), geom = c("point"),ellipse.type = "euclid")

R中的空间聚类/采样

1 个答案: