将数据框中的变量强制转换为适当的格式

时间:2017-08-29 13:47:48

标签: r type-conversion coercion

我正在处理一个由多种不同数据类型(数字,字符,时间戳)组成的数据框,但不幸的是,它们都是作为字符接收的。因此,我需要动态且尽可能高效地将它们强制转换为“适当的”格式。

考虑以下示例:

df <- data.frame("val1" = c("1","2","3","4"), "val2" = c("A", "B", "C", "D"), stringsAsFactors = FALSE)

我显然希望val1为数字,val2保留为字符。因此,我的结果应如下所示:

'data.frame':   4 obs. of  2 variables:
 $ val1: num  1 2 3 4
 $ val2: chr  "A" "B" "C" "D"

现在我正在通过检查强制是否会导致NULL来完成此操作,然后如果情况并非如此则继续强制执行:

res <- as.data.frame(lapply(df, function(x){

  x <- sapply(x, function(y) {
    if (is.na(as.numeric(y))) {
      return(y)
    } else {
      y <- as.numeric(y)
      return(y)
    }    
  })

  return(x)

}), stringsAsFactors = FALSE) 

然而,由于存在多个问题,这并不能成为正确的解决方案:

  1. 我怀疑有更快的方法来实现这个
  2. 出于某种原因,我收到了警告In FUN(X[[i]], ...) : NAs introduced by coercion,虽然情况并非如此(见结果)
  3. 在处理其他数据类型时,这似乎不合适,即日期
  4. 对于这种或另一种更可持续的解决方案,是否存在通用的启发式方法?感谢

2 个答案:

答案 0 :(得分:2)

最近的文件阅读器如data.table::freadreadr包在识别和将列转换为适当类型方面做得相当不错。

所以我的第一反应是建议将数据写入文件并再次读取,例如,

library(data.table)
fwrite(df, "dummy.csv")
df_new <- fread("dummy.csv")
str(df_new)
Classes ‘data.table’ and 'data.frame':    4 obs. of  2 variables:
 $ val1: int  1 2 3 4
 $ val2: chr  "A" "B" "C" "D"
 - attr(*, ".internal.selfref")=<externalptr>

或者实际上没有写入磁盘:

df_new <- fread(paste(capture.output(fwrite(df, "")), collapse = "\n"))

然而,d.b's suggestions更聪明,但需要进行一些抛光以避免强制因素:

df[] <- lapply(df, type.convert, as.is = TRUE)
str(df)
'data.frame': 4 obs. of  2 variables:
 $ val1: int  1 2 3 4
 $ val2: chr  "A" "B" "C" "D"

df[] <- lapply(df, readr::parse_guess)

答案 1 :(得分:0)

您应该检查dataPreparation包裹。您将找到功能findAndTransformNumerics功能,它将完全按照您的意愿执行。

require(dataPreparation)
data("messy_adult")
sapply(messy_adult[, .(num1, num2, mail)], class)
   num1        num2        mail 
"character" "character"    "factor" 

messy_adult是一个丑陋的数据集,用于说明此包中的函数。这里num1和num2是字符串:/

messy_adult <- findAndTransformNumerics(messy_adult)
[1] "findAndTransformNumerics: It took me 0.18s to identify 3 numerics column(s), i will set them as numerics"
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum1"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I will set some columns as numeric"
[1] "setColAsNumeric: I am doing the columnnum2"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "setColAsNumeric: I am doing the columnnum3"
[1] "setColAsNumeric: 0 NA have been created due to transformation to numeric."
[1] "findAndTransformNumerics: It took me 0.09s to transform 3 column(s) to a numeric format."

我们在这里执行了搜索并记录了它找到的内容

知道:

sapply(messy_adult[, .(num1, num2, mail)], class)
     num1      num2      mail 
"numeric" "numeric"  "factor" 

希望它有所帮助!

Disclamer:我是这个软件包的作者。