有条件地将因子创建为数据框列

时间:2014-06-11 11:53:04

标签: r dataframe vectorization

我有简单数据框,其中包含开源软件版本的相关信息,如下所示:

> head(a, n=50)
   Project ID         Latest Release
1          14      dhiggen_merge-5.0
2          11                  r2-00
3           2              Snapshots
4          70                   1.90
5          72                    2.5
6          30   AfterStep 2.00.beta5
7          38                    1.0
8           7            gedit 0.9.5
9          92                   1.0b
10         93             2001-11-19
11         68                 1.9.97
12         15                3.0-RC8
13         47                3.23.52
14          3                    7.5
15         12                  0.9.7
16         19                 2.0.5a
17         31 wm-session-hacks-0.1.0
18         75               1.16r6.1
19         16             udb-1.8-29
20         21                    0.1
21         64                  0.6.2
22         34                  0.3.1
23         35
24         99                  2.0.8
25         44                1.2.6.1
26         22                 0.94.3
27         32                  1.5.0
28         78                   .92q

我编写了以下转换函数,以创建factor类的新数据框列,以确定软件的成熟度,基于非常简单条件

prjMaturity <- function (indicator, data) {

  var <- data[["Latest Release"]]

  rx <- "^(.*-)?([[:digit:]]+\\.)?([[:digit:]]+\\.)?(\\*|[[:digit:]]+)$"
  major <- gsub(rx, "\\2", var)
  major <- substr(major, 1, nchar(major)-1)
  major <- as.numeric(major)

  if (major > 0 && major < 1)   maturity <- "Alpha/Beta"
  if (major >= 1 && major <= 2) matirity <- "Stable"
  if (major > 2)                maturity <- "Mature"

  data["Project Maturity"] <- as.factor(maturity)

  if (DEBUG2) {print(summary(data)); message("")}

  return (data)
}

但是,运行此代码会导致意外错误的结果以及警告

  Project ID        Latest Release     Project Maturity
 Length:28          Length:28          Mature:28
 Class1:avector     Class1:avector
 Class2:avector     Class2:avector
 Class3:character   Class3:character
 Mode  :character   Mode  :character

Warning messages:
1: In (function (indicator, data)  : NAs introduced by coercion
2: In if (major > 2) maturity <- "Mature" :
  the condition has length > 1 and only the first element will be used

我做错了什么或错过了什么?谢谢!

2 个答案:

答案 0 :(得分:1)

你可以使用?cut()

major
[1] 5.00   NA   NA 1.00 2.00   NA 1.00   NA 1.00   NA 1.00   NA 3.00 7.00 0.00
[16]   NA 0.00   NA   NA 0.00 0.00 0.00   NA 2.00   NA 0.00 1.00 0.92

cut(major, breaks=c(0+0.01,1-0.01,2,Inf),include.lowest=TRUE,labels=c("Alpha/Beta","Stable","Mature"))
[1] Mature     <NA>       <NA>       Stable     Stable     <NA>      
[7] Stable     <NA>       Stable     <NA>       Stable     <NA>      
[13] Mature     Mature     <NA>       <NA>       <NA>       <NA>      
[19] <NA>       <NA>       <NA>       <NA>       <NA>       Stable    
[25] <NA>       <NA>       Stable     Alpha/Beta

答案 1 :(得分:0)

最初发布为更新,然后决定发布我的回答,考虑到我所做的更改。

我改进了我的代码,包括处理现有各种数据的正则表达式:

prjMaturity <- function (indicator, data) {

  # do not process, if target column (type) already exists
  if (is.factor(data[["Project Maturity"]])) {
    message("Project Maturity: ", appendLF = FALSE)
    message("Not processing - Transformation already performed!\n")
    return (invisible())
  }

  var <- data[["Latest Release"]]

  rx <- "^([^[:digit:]]*)([[:digit:]]+)(\\.|-)+(.*)$"
  major <- gsub(rx, "\\2", var)
  major <- as.numeric(major)

  data[["Project Maturity"]] <- 
    cut(major, breaks = c(0, 1, 2, Inf), include.lowest = TRUE,
        right = FALSE, labels=c("Alpha/Beta", "Stable", "Mature"))

  if (DEBUG2) {print(summary(data)); message("")}

  return (data)
}