乳腺癌威斯康星数据库的二进制表示

时间:2018-04-09 18:37:46

标签: r dataframe dplyr categorical-data

我想生成众所周知的乳腺癌威斯康星数据库的二进制表示。

初始数据集有31个数值变量和一个分类变量。

 id_number diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean
1    842302         M       17.99        10.38         122.80    1001.0         0.11840          0.27760         0.3001             0.14710        0.2419
2    842517         M       20.57        17.77         132.90    1326.0         0.08474          0.07864         0.0869             0.07017        0.1812
3  84300903         M       19.69        21.25         130.00    1203.0         0.10960          0.15990         0.1974             0.12790        0.2069
4  84348301         M       11.42        20.38          77.58     386.1         0.14250          0.28390         0.2414             0.10520        0.2597
5  84358402         M       20.29        14.34         135.10    1297.0         0.10030          0.13280         0.1980             0.10430        0.1809

我想通过以下方式生成此数据框的二进制表示:

将诊断列(水平= M,B)转换为两列诊断_M和诊断_B,并根据初始列(M或B)中的值将1或0放入相关行。

查找每个数字列的中位数,并将其拆分为两列,具体取决于值是大于还是小于平均值。例如:对于列radius_mean,将其拆分为radius_mean_great in - 如果值> gt,我们将其放入1意思是,否则;和反射列radius_mean_low。

library(mlbench) 
library("RCurl") 
library("curl")
UCI_data_URL <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data') 

names <- c('id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst') 

breast.cancer.fr <- read.table(textConnection(UCI_data_URL), sep = ',', col.names = names) 

1 个答案:

答案 0 :(得分:0)

有几种方法可以将基础二值化,我发现以下内容我希望它可以提供

df <- breast.cancer.fr[,3:32]
df2 <- matrix(NA, ncol = 2*ncol(df), nrow = nrow(df))
for(i in 1:ncol(df)){
df2[,2*i-1]<- as.numeric(df[,i] >  mean(df[,i]))
df2[,2*i]  <- as.numeric(df[,i] <= mean(df[,i]))}
colnames(df2) <- c(rbind(paste0(names(df),"_great"),paste0(names(df),"_low")))

library(dplyr)
df3 <- select(breast.cancer.fr,id_number,diagnosis) %>% mutate(diagnosis_M = as.numeric(diagnosis == "M")) %>%
  mutate(diagnosis_B = as.numeric(diagnosis == "B"))

df <- cbind(df3[,-2],df2)
df[1:10,1:7]
   id_number diagnosis_M diagnosis_B radius_mean_great radius_mean_low texture_mean_great texture_mean_low
1     842302           1           0                 1               0                  0                1
2     842517           1           0                 1               0                  0                1
3   84300903           1           0                 1               0                  1                0
4   84348301           1           0                 0               1                  1                0
5   84358402           1           0                 1               0                  0                1
6     843786           1           0                 0               1                  0                1
7     844359           1           0                 1               0                  1                0
8   84458202           1           0                 0               1                  1                0
9     844981           1           0                 0               1                  1                0
10  84501001           1           0                 0               1                  1                0