将字符格式的因子变量转换为数字

时间:2015-10-03 17:29:25

标签: r

我正在尝试将因子变量转换为数字。我尝试了这两种解决方案 -

as.numeric(levels(f))[f] 

as.numeric(as.character(f))

但问题仍然存在。警告消息 - 强制引入的NA

可重复的例子 -

df = data.frame(x = c("10: Already Delinquent 90+",
                      "11: Credit History <6 Months",
                      "12: Current Balance = 0",
                      "13: Balance (2-6)=0",
                      "20: 1+ x 90+",
                      "30: 3+ x 60-89",
                      "31: 2 x 60-89",
                      "32: 1 x 60-89",
                      "40: 3+ x 30-59",
                      "41: 2 x 30-59",
                      "42: 1 x 30-59",
                      "50: Insufficient Performance",
                      "60: 3+ x 1-29",
                      "61: 2 x 1-29",
                      "62: 1 x 1-29",
                      "70: Never delinquent"),
                y = c("00:Bad",
                      "01:Ind",
                      "02:Good",
                      "NA",
                      "00:Bad",
                      "01:Ind",
                      "02:Good",
                      "NA",
                      "00:Bad",
                      "01:Ind",
                      "02:Good",
                      "NA",
                      "00:Bad",
                      "01:Ind",
                      "02:Good",
                      "NA"),
                z = ceiling(rnorm(16)))

#Select all the factor variables
factorvars = colnames(df)[which(sapply(df,is.factor))]

#Concatenate with "_Num"
xxx <- paste(factorvars, "_Num", sep="")

#Converting Factor to Numeric
for (i in 1:length(factorvars))
df[,xxx[i]] = NA
df[,xxx[i]] = as.numeric(levels(df[,factorvars[i]]) [df[,factorvars[i]]])

我希望保留因子变量并创建新的变量,并将级别转换为数字。所需的输出如下所示 -

x   y   x_num   y_num
10: Already Delinquent 90+  00:Bad  1   1
11: Credit History <6 Months    01:Ind  2   2
12: Current Balance = 0 02:Good 3   3
13: Balance (2-6)=0 NA  4   NA
20: 1+ x 90+    00:Bad  5   1
30: 3+ x 60-89  01:Ind  6   2
31: 2 x 60-89   02:Good 7   3
32: 1 x 60-89   NA  8   NA
40: 3+ x 30-59  00:Bad  9   1
41: 2 x 30-59   01:Ind  10  2
42: 1 x 30-59   02:Good 11  3
50: Insufficient Performance    NA  12  NA
60: 3+ x 1-29   00:Bad  13  1
61: 2 x 1-29    01:Ind  14  2
62: 1 x 1-29    02:Good 15  3
70: Never delinquent    NA  16  NA

1 个答案:

答案 0 :(得分:2)

根据您所需的输出判断,您看起来并不想将因子转换为字符串中包含的数字。相反,您需要内部表示因子。

试试这个:

df[,xxx] <- lapply(df[,factorvars], as.numeric)
#                               x       y  z x_Num y_Num
# 1    10: Already Delinquent 90+  00:Bad  2     1     1
# 2  11: Credit History <6 Months  01:Ind  2     2     2
# 3       12: Current Balance = 0 02:Good  1     3     3
# 4           13: Balance (2-6)=0    <NA>  1     4    NA
# 5                  20: 1+ x 90+  00:Bad  0     5     1
# 6                30: 3+ x 60-89  01:Ind  0     6     2
# 7                 31: 2 x 60-89 02:Good  0     7     3
# 8                 32: 1 x 60-89    <NA>  0     8    NA
# 9                40: 3+ x 30-59  00:Bad  2     9     1
# 10                41: 2 x 30-59  01:Ind  0    10     2
# 11                42: 1 x 30-59 02:Good  0    11     3
# 12 50: Insufficient Performance    <NA>  1    12    NA
# 13                60: 3+ x 1-29  00:Bad  1    13     1
# 14                 61: 2 x 1-29  01:Ind -1    14     2
# 15                 62: 1 x 1-29 02:Good -1    15     3
# 16         70: Never delinquent    <NA> -1    16    NA

数据

我通过更改字符串&#34; NA&#34;来清理您的示例数据。到实际的NA值:

is.na(df$y) <- df$y == "NA"
df$y <- droplevels(df$y)