我正在处理一个由多种不同数据类型(数字,字符,时间戳)组成的数据框,但不幸的是,它们都是作为字符接收的。因此,我需要动态且尽可能高效地将它们强制转换为“适当的”格式。
考虑以下示例:
df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)
我显然希望val1
为数字,val2
保留为字符。因此,我的结果应如下所示:
'data.frame': 4 obs. of 2 variables:
$ val1: num 1 2 3 4
$ val2: chr "A" "B" "C" "D"
现在我正在通过检查强制是否会导致NULL
来完成此操作,然后如果情况并非如此则继续强制执行:
res <- as.data.frame(lapply(df, function(x){
x <- sapply(x, function(y) {
if (is.na(as.numeric(y))) {
return(y)
} else {
y <- as.numeric(y)
return(y)
}
})
return(x)
}), stringsAsFactors = FALSE)
然而,由于存在多个问题,这并不能成为正确的解决方案:
In FUN(X[[i]], ...) : NAs introduced by coercion
,虽然情况并非如此(见结果)对于这种或另一种更可持续的解决方案,是否存在通用的启发式方法?感谢
答案 0 :(得分:2)
最近的文件阅读器如data.table::fread
或readr
包在识别和将列转换为适当类型方面做得相当不错。
所以我的第一反应是建议将数据写入文件并再次读取,例如,
library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame': 4 obs. of 2 variables: $ val1: int 1 2 3 4 $ val2: chr "A" "B" "C" "D" - attr(*, ".internal.selfref")=<externalptr>
或者实际上没有写入磁盘:
df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))
然而,d.b's suggestions更聪明,但需要进行一些抛光以避免强制因素:
df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of 2 variables: $ val1: int 1 2 3 4 $ val2: chr "A" "B" "C" "D"
或
df[] <- lapply(df, readr::parse_guess)
答案 1 :(得分:0)
您应该检查dataPreparation
包裹。您将找到功能findAndTransformNumerics
功能,它将完全按照您的意愿执行。
require(dataPreparation)
data("messy_adult")
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"character" "character" "factor"
messy_adult是一个丑陋的数据集,用于说明此包中的函数。这里num1和num2是字符串:/
messy_adult <- findAndTransformNumerics(messy_adult)
[1] "findAndTransformNumerics: It took me 0.18s to identify 3 numerics column(s), i will set them as numerics"
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum1"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum2"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I am doing the columnnum3"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "findAndTransformNumerics: It took me 0.09s to transform 3 column(s) to a numeric format."
我们在这里执行了搜索并记录了它找到的内容
知道:
sapply(messy_adult[, .(num1, num2, mail)], class)
num1 num2 mail
"numeric" "numeric" "factor"
希望它有所帮助!
Disclamer:我是这个软件包的作者。