我有一个数据框data
,其中有一个名为“Project License”的列,它代表一个分类变量,因此,在R术语中,是 factor 。我正在尝试创建一个新列,其中开源软件许可证按照分类组合成更大的类别。但是,当我尝试组合(合并)该级别的时,我最终会得到一个列,其中所有级别都会丢失,或者保持不变,或者出现错误消息,例如下面的一个:
因子出错(数据[[“项目许可证”]],等级=分类, labels = c(“Highly Restrictive”,: 无效的'标签';长度4应为1或6
以下是此功能的代码(从函数中提取):
myLevels <- c('gpl', 'lgpl', 'bsd',
'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
'Other', 'Artistic', 'Public')
licenses <- factor(data[["Project License"]],
levels = myLevels, labels = myLabels)
data[["Project License"]] <- licenses
classification <- c(highly = c('gpl'),
restrictive = c('lgpl', 'public'),
permissive = c('bsd', 'artistic'),
unknown = c('other'))
restrictiveness <-
factor(data[["Project License"]],
levels = classification,
labels = c('Highly Restrictive', 'Restrictive',
'Permissive', 'Unknown'))
data[["License Restrictiveness"]] <- restrictiveness
我还尝试了一些其他方法(包括“R Inferno”第8.2.5节中描述的方法),但到目前为止还没有成功。
我做错了什么以及如何解决这个问题?谢谢!
更新(数据):
> head(data, n=20)
Project ID Project License
1 45556 lgpl
2 41636 bsd
3 95627 gpl
4 66930 gpl
5 51103 gpl
6 65637 gpl
7 41834 gpl
8 70998 gpl
9 95064 gpl
10 48810 lgpl
11 95934 gpl
12 90909 gpl
13 6538 website
14 16439 gpl
15 41924 gpl
16 78987 gpl
17 58662 zlib
18 1904 bsd
19 93838 public
20 90047 lgpl
> str(data)
'data.frame': 45033 obs. of 2 variables:
$ Project ID : chr "45556" "41636" "95627" "66930" ...
$ Project License: chr "lgpl" "bsd" "gpl" "gpl" ...
- attr(*, "SQL")=Class 'base64' chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
- attr(*, "indicatorName")=Class 'base64' chr "cHJqTGljZW5zZQ=="
- attr(*, "resultNames")=Class 'base64' chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"
更新2(数据):
> unique(data[["Project License"]])
[1] "lgpl" "bsd" "gpl" "website" "zlib"
[6] "public" "other" "ibmcpl" "rpl" "mpl11"
[11] "mit" "afl" "python" "mpl" "apache"
[16] "osl" "w3c" "iosl" "artistic" "apsl"
[21] "ibm" "plan9" "php" "qpl" "psfl"
[26] "ncsa" "rscpl" "sunpublic" "zope" "eiffel"
[31] "nethack" "sissl" "none" "opengroup" "sleepycat"
[36] "nokia" "attribut" "xnet" "eiffel2" "wxwindows"
[41] "motosoto" "vovida" "jabber" "cvw" "historical"
[46] "nausite" "real"
答案 0 :(得分:3)
问题在于级别数不等于因子创建中的标签数量,也不是长度为1。
来自?factor
:
labels
either an optional character vector of labels for the levels (in the same order as
levels after removing those in exclude), or a character string of length 1.
你需要让这些达成一致。 classification
中的名称并不是factor
组合标签的提示。
例如:
factor(..., levels=classification, labels=c('Highly Restrictive',
'Restrictive.1',
'Restrictive.2',
'Permissive.1',
'Permissive.2',
'Unknown'))
要将因子映射到具有较少级别的另一个因子,可以按名称索引向量。将classification
向量作为查找转换:
classification <- c(gpl='Highly Restrictive',
lgpl='Restrictive',
public='Restrictive',
bsd='Permissive',
artistic='Permissive',
other='Unknown')
将其用作查找表:
data[["License Restrictiveness"]] <-
as.factor(classification[as.character(data[['Project License']])])
head(data)
## Project ID Project License License Restrictiveness
## 1 45556 lgpl Restrictive
## 2 41636 bsd Permissive
## 3 95627 gpl Highly Restrictive
## 4 66930 gpl Highly Restrictive
## 5 51103 gpl Highly Restrictive
## 6 65637 gpl Highly Restrictive
答案 1 :(得分:1)
如果您先转换为角色,例如(未经测试)
,也许您的任务会变得更容易license.map <- c(lgpl="Permissive", bsd="Permissive",
gpl="Restrictive", website="Unkown") # etc.
dat <- transform(dat, LicenseType=license.map[Project.License])
由于默认stringsAsFactor为True
,因此新列是一个因素。