我已经浏览了不同的链接,例如:How to convert a factor to an integer\numeric without a loss of information?
但无法解决问题
我有一个数据框
SYMBOL PVALUE1 PVALUE2
1 10-Mar 0.813027629406118 0.78820189558684
2 10-Sep 0.00167287722066533 0.00167287722066533
3 11-Mar 0.21179810441316 0.464576340307205
4 11-Sep 0.00221961024320294 0.00221961024320294
5 12-Sep 0.934667427815304 0.986884425214009
6 15-Sep 0.00167287722066533 0.00167287722066533
7 1-Dec 0.464576340307205 0.0911572830792113
8 1-Mar 0.00818426308604705 0.0252302356363697
9 1-Sep 0.60516237199519 0.570568468332992
10 2-Mar 0.0103975819620539 0.00382292568622066
11 2-Sep 0.00167287722066533 0.00167287722066533
当我尝试str()
str(df)
'data.frame': 20305 obs. of 3 variables:
$ SYMBOL : Factor w/ 21050 levels "","10-Mar","10-Sep",..: 2 3 4 5 6 7 8 9 10 11 ...
$ PVALUE1: Factor w/ 209 levels "0","0.000109570493049298",..: 169 22 110 24 181 22 139 39 149 44 ...
$ PVALUE2: Factor w/ 216 levels "0","0.000109570493049298",..: 172 20 141 23 201 20 90 61 150 29 ...
我尝试mode()
sapply(df,mode)
SYMBOL PVALUE1 PVALUE2
"numeric" "numeric" "numeric"
当我尝试根据下面的条件分配值时,按
分配给两个数字列(2,3)df$Score <- rowSums(ifelse(df[,-1]==0, 0,
ifelse(df[, -1]<= 0.05, 2, ifelse(df[,-1]>= 0.065,-2,1))))
I get Warning messages:
1: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
2: In Ops.factor(left, right) : ‘<=’ not meaningful for factors
3: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
4: In Ops.factor(left, right) : ‘>=’ not meaningful for factors
输出如下:
SYMBOL PVALUE1 PVALUE2 Score
1 10-Mar 0.813027629406118 0.78820189558684 NA
2 10-Sep 0.00167287722066533 0.00167287722066533 NA
3 11-Mar 0.21179810441316 0.464576340307205 NA
4 11-Sep 0.00221961024320294 0.00221961024320294 NA
5 12-Sep 0.934667427815304 0.986884425214009 NA
6 15-Sep 0.00167287722066533 0.00167287722066533 NA
如果因子已经是数字,为什么上面的代码不起作用并给出NA
。我该怎么办呢。
修改 dput()
structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep",
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L,
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294",
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"),
PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533",
"0.00221961024320294", "0.464576340307205", "0.78820189558684",
"0.986884425214009"), class = "factor")), .Names = c("SYMBOL",
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")
我也尝试了这个:
indx <- sapply(df, is.factor)
df[indx] <- lapply(df[indx], function(x) as.numeric(levels(x))[x])
indx returns
SYMBOL PVALUE1 PVALUE2
TRUE TRUE TRUE
Warning message:
In FUN(X[[3L]], ...) : NAs introduced by coercion
答案 0 :(得分:3)
使用您的dput
数据,这很好用:
df = structure(list(SYMBOL = structure(1:6, .Label = c("10-Mar", "10-Sep",
"11-Mar", "11-Sep", "12-Sep", "15-Sep"), class = "factor"), PVALUE1 = structure(c(4L,
1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533", "0.00221961024320294",
"0.21179810441316", "0.813027629406118", "0.934667427815304"), class = "factor"),
PVALUE2 = structure(c(4L, 1L, 3L, 2L, 5L, 1L), .Label = c("0.00167287722066533",
"0.00221961024320294", "0.464576340307205", "0.78820189558684",
"0.986884425214009"), class = "factor")), .Names = c("SYMBOL",
"PVALUE1", "PVALUE2"), row.names = c(NA, 6L), class = "data.frame")
df$PVALUE1 = as.numeric(as.character(df$PVALUE1))
df$PVALUE2 = as.numeric(as.character(df$PVALUE2))
df
# SYMBOL PVALUE1 PVALUE2
# 1 10-Mar 0.813027629 0.788201896
# 2 10-Sep 0.001672877 0.001672877
# 3 11-Mar 0.211798104 0.464576340
# 4 11-Sep 0.002219610 0.002219610
# 5 12-Sep 0.934667428 0.986884425
# 6 15-Sep 0.001672877 0.001672877
sapply(df, class)
# SYMBOL PVALUE1 PVALUE2
# "factor" "numeric" "numeric"
如果您在整个数据框中遇到此类问题,可能会出现一些不规则的行。但是,我还查看了您在评论中提供的CSV,看起来很不错。
另请注意,这是您链接的重复问题中的几个等效解决方案之一。
要转换除第一列以外的所有列,您可以执行
df[, 2:ncol(df)] = lapply(df[, -1], function(x) as.numeric(as.character(x)))
请注意,您不希望以这种方式转换日期列或SYMBOL列,因为它们不是数字。
同样,要将名为PVALUE1
的列转换为PVALUE47
,您可以构建列名然后转换它们:
col_to_convert = paste0("PVALUE", 1:47)
df[, col_to_convert] = lapply(df[, col_to_convert], function(x) as.numeric(as.character(x)))
一般来说,最佳做法是首先不要将这些列作为因素。但是你可以将这些数据输入到R中,可能有一种指定列类的方法,例如read.table,read.csv等中的colClasses
。
答案 1 :(得分:3)
使用data.table
library(data.table)
setDT(df)[, 2:3 := lapply(.SD, function(x)
as.numeric(levels(x))[x]), .SDcols=2:3]
或者更快一点的版本是使用set
indx <- which(sapply(df, is.factor) & grepl('PVALUE', names(df)))
setDT(df)
for(j in indx){
set(df, i=NULL, j=j, value= as.numeric(levels(df[[j]]))[df[[j]]])
}
我猜你之所以收到警告是因为&#39; indx&#39;你创建的还包括第一列(因为它也是一个因素),但它是非数字的。通过将非数字元素从factor
转换为numeric
,这些元素将被强制转换为NA。
根据?factor
将因子'f'转换为近似值 建议使用原始数值'as.numeric(levels(f))[f]' 并且比'as.numeric(as.character(f))'稍微高效。