在数据框列中组合因子级别

时间:2014-06-07 14:40:37

标签: r taxonomy categorical-data merging-data

我有一个数据框data,其中有一个名为“Project License”的列,它代表一个分类变量,因此,在R术语中,是 factor 。我正在尝试创建一个新列,其中开源软件许可证按照分类组合成更大的类别。但是,当我尝试组合(合并)该级别的时,我最终会得到一个列,其中所有级别都会丢失,或者保持不变,或者出现错误消息,例如下面的一个:

  

因子出错(数据[[“项目许可证”]],等级=分类,   labels = c(“Highly Restrictive”,:       无效的'标签';长度4应为1或6

以下是此功能的代码(从函数中提取):

myLevels <- c('gpl', 'lgpl', 'bsd',
              'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
              'Other', 'Artistic', 'Public')

licenses <- factor(data[["Project License"]],
                   levels = myLevels, labels = myLabels)

data[["Project License"]] <- licenses

classification <- c(highly = c('gpl'),
                    restrictive = c('lgpl', 'public'),
                    permissive = c('bsd', 'artistic'),
                    unknown = c('other'))

restrictiveness <- 
  factor(data[["Project License"]],
         levels = classification,
         labels = c('Highly Restrictive', 'Restrictive',
                    'Permissive', 'Unknown'))

data[["License Restrictiveness"]] <- restrictiveness

我还尝试了一些其他方法(包括“R Inferno”第8.2.5节中描述的方法),但到目前为止还没有成功。

我做错了什么以及如何解决这个问题?谢谢!

更新(数据):

> head(data, n=20)
   Project ID Project License
1       45556            lgpl
2       41636             bsd
3       95627             gpl
4       66930             gpl
5       51103             gpl
6       65637             gpl
7       41834             gpl
8       70998             gpl
9       95064             gpl
10      48810            lgpl
11      95934             gpl
12      90909             gpl
13       6538         website
14      16439             gpl
15      41924             gpl
16      78987             gpl
17      58662            zlib
18       1904             bsd
19      93838          public
20      90047            lgpl

> str(data)
'data.frame':   45033 obs. of  2 variables:
 $ Project ID     : chr  "45556" "41636" "95627" "66930" ...
 $ Project License: chr  "lgpl" "bsd" "gpl" "gpl" ...
 - attr(*, "SQL")=Class 'base64'  chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
 - attr(*, "indicatorName")=Class 'base64'  chr "cHJqTGljZW5zZQ=="
 - attr(*, "resultNames")=Class 'base64'  chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"

更新2(数据):

> unique(data[["Project License"]])
 [1] "lgpl"       "bsd"        "gpl"        "website"    "zlib"
 [6] "public"     "other"      "ibmcpl"     "rpl"        "mpl11"
[11] "mit"        "afl"        "python"     "mpl"        "apache"
[16] "osl"        "w3c"        "iosl"       "artistic"   "apsl"
[21] "ibm"        "plan9"      "php"        "qpl"        "psfl"
[26] "ncsa"       "rscpl"      "sunpublic"  "zope"       "eiffel"
[31] "nethack"    "sissl"      "none"       "opengroup"  "sleepycat"
[36] "nokia"      "attribut"   "xnet"       "eiffel2"    "wxwindows"
[41] "motosoto"   "vovida"     "jabber"     "cvw"        "historical"
[46] "nausite"    "real"

2 个答案:

答案 0 :(得分:3)

问题在于级别数不等于因子创建中的标签数量,也不是长度为1。

来自?factor

labels  
  either an optional character vector of labels for the levels (in the same order as
  levels after removing those in exclude), or a character string of length 1.

你需要让这些达成一致。 classification中的名称并不是factor组合标签的提示。

例如:

factor(..., levels=classification, labels=c('Highly Restrictive',
                                            'Restrictive.1',
                                            'Restrictive.2',
                                            'Permissive.1',
                                            'Permissive.2',
                                            'Unknown'))

要将因子映射到具有较少级别的另一个因子,可以按名称索引向量。将classification向量作为查找转换:

 classification <- c(gpl='Highly Restrictive',
                     lgpl='Restrictive', 
                     public='Restrictive',
                     bsd='Permissive',
                     artistic='Permissive',
                     other='Unknown')

将其用作查找表:

data[["License Restrictiveness"]] <- 
    as.factor(classification[as.character(data[['Project License']])])

head(data)
##   Project ID Project License License Restrictiveness
## 1      45556            lgpl             Restrictive
## 2      41636             bsd              Permissive
## 3      95627             gpl      Highly Restrictive
## 4      66930             gpl      Highly Restrictive
## 5      51103             gpl      Highly Restrictive
## 6      65637             gpl      Highly Restrictive

答案 1 :(得分:1)

如果您先转换为角色,例如(未经测试)

,也许您的任务会变得更容易
license.map <- c(lgpl="Permissive", bsd="Permissive", 
                 gpl="Restrictive", website="Unkown") # etc.
dat <- transform(dat, LicenseType=license.map[Project.License])

由于默认stringsAsFactor为True,因此新列是一个因素。