分箱字符变量

时间:2016-03-25 17:02:45

标签: r data-manipulation

我有class变量,它是maritalgenderage(例如MM32)的串联。我想对它们进行分组,以便最终输出如下:

Class ClassGrp
SM20  SM20-25
SM21  SM20-25
SM22  SM20-25
MF20  MF20-25
MF21  MF20-25
SF30  SF26-30
SF31  SF31-35

我有agegendermarital的单独列,因此我的初始流程是age cut函数中断cut(data$Class, breaks = 10) }}。但是,我无法想到如何将它们转换为20-25格式。

修改

输入数据

data <- structure(list(age = c(19L, 20L, 20L, 21L, 21L, 22L), gender = structure(c(2L, 
1L, 2L, 1L, 2L, 1L), .Label = c("Female", "Male"), class = "factor"), 
    marital = structure(c(3L, 3L, 3L, 3L, 3L, 2L), .Label = c("Divorced", 
    "Married", "Single", "Widowed"), class = "factor"), class = c("SM19", 
    "SF20", "SM20", "SF21", "SM21", "MF22"), ageGrp = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L), .Label = c("(18.9,25]", "(25,31]", "(31,37]", 
    "(37,43]", "(43,49]", "(49,55]", "(55,61]", "(61,67]", "(67,73]", 
    "(73,79.1]"), class = "factor")), .Names = c("age", "gender", 
"marital", "class", "ageGrp"), row.names = c(NA, 6L), class = "data.frame")

2 个答案:

答案 0 :(得分:1)

# Read data
x <- read.table(file = "clipboard")

# Show the data I read in
x

# Bin the data as requested
x$ClassGrp <- as.character(x$ageGrp)
x$ClassGrp <- gsub("\\(", "", x$ClassGrp)
x$ClassGrp <- gsub("\\]", "", x$ClassGrp)
x$ClassGrp <- gsub(",", "-", x$ClassGrp)
x$ClassGrp <- gsub("18.9", "19", x$ClassGrp)
x$g        <- "M"

x$g[x$gender == "Female"] <- "F"
x$m  <- "S"
x$m[x$marital == "Married"] <- "M"

for(i in 1:nrow(x)){
  x$ClassGrp[i] <- paste(x$g[i],x$m[i],x$ClassGrp[i], collapse="", sep="")  
}


x$g <- NULL
x$m <- NULL

# Show results
x

   age gender marital class    ageGrp ClassGrp
1   19   Male  Single  SM19 (18.9,25]  MS19-25
2   20 Female  Single  SF20 (18.9,25]  FS19-25
3   20   Male  Single  SM20 (18.9,25]  MS19-25
4   21 Female  Single  SF21 (18.9,25]  FS19-25
5   21   Male  Single  SM21 (18.9,25]  MS19-25
6   22 Female Married  MF22 (18.9,25]  FM19-25
7   22 Female  Single  SF22 (18.9,25]  FS19-25
8   22   Male Married  MM22 (18.9,25]  MM19-25
9   22   Male  Single  SM22 (18.9,25]  MS19-25
10  23 Female Married  MF23 (18.9,25]  FM19-25

答案 1 :(得分:1)

您可以将输出箱定义为已排序的数组,并检查包含值的位置(大于或等于1,小于以下值)。

我还添加了一个控制检查,以防您的值超出您的容器(即可能小于最小值,或大于最大值)。

# Important: they should be ordered!                                                                                                                                                                                                                                                                                          "marital", "class", "ageGrp"), row.names = c(NA, 6L), class = "data.frame")
my.bins <- c(20, 25, 30, 35, 40, 50, 65)

# Transform into bins
to.bin <- function(class) {
  gender.marital <- substring(class, 1, 2)
  age <- as.numeric(substring(class, 3))
  # Check the boundaries
  if (age < min(my.bins)) {
    return(paste0(gender.marital, "<", min(my.bins)))
  } else if (age >= max(my.bins)) {
    return(paste0(gender.marital, ">=", max(my.bins)))
  }
  lower <- which(my.bins > age)[1]
  return(paste0(gender.marital, my.bins[lower - 1], "-", my.bins[lower] - 1))
}

data$ClassGrp <- sapply(data$class, to.bin)
data

代码将您的数据返回:

  age gender marital class    ageGrp ClassGrp
1  19   Male  Single  SM19 (18.9,25]    SM<20
2  20 Female  Single  SF20 (18.9,25]  SF20-24
3  20   Male  Single  SM20 (18.9,25]  SM20-24
4  21 Female  Single  SF21 (18.9,25]  SF20-24
5  21   Male  Single  SM21 (18.9,25]  SM20-24
6  22 Female Married  MF22 (18.9,25]  MF20-24