Question

我在下面的代码中将s3中的文件读入spark框架并匿名化文件中的数据

library(data.table)
library(digest)


ct_test <- spark_read_csv(
  sc,
  name = "test_data",
  memory = FALSE,
  path = "s3://XXXXXXX/sunny/Sample_data.csv",
  header = TRUE, 
  delimiter = ",",
  stringsAsFactors = FALSE

)

cols_to_mask <- c("Email","Phone")

anonymize <- function(x, algo="crc32") {
  sapply(x, function(y) if(y == "" | is.na(y)) "" else digest(y, algo = algo))
}

setDT(ct_test)
ct_test[, (cols_to_mask) := lapply(.SD, anonymize), .SDcols = cols_to_mask]

print(ct_test)

但是代码失败并出现以下错误

Error in setDT(ct_test) : 
  All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: [1:1, 2:1]
The first entry with fewer than 2 entries is 1
> ct_test[, (cols_to_mask) := lapply(.SD, anonymize), .SDcols = cols_to_mask]
Error in `:=`((cols_to_mask), lapply(.SD, anonymize)) : 
  Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
>

任何帮助解决该问题的帮助，

下面是str（ct_test）的输出

$ ops:List of 2
  ..$ x   : 'ident' chr "cx_data"
  ..$ vars: chr [1:5] "ID" "Name" "Email" "Phone" ...
  ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
 - attr(*, "class")= chr [1:4] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"

输入数据集

ID,Name,Email,Phone,Survey
10,Ravi,test@gmail.com,874589,Survey 1
20,John,abc@gmail.com,878756,Survey 2
30,Smith,tt@yahoo.com,565656,Survey 3
40,Kevin,,,Survey 3

根据建议，更改代码如下

cx_data <- spark_read_csv(
  sc,
  name = "cx_data",
  memory = FALSE,
  path = "s3://xxxx/sunny/Sample_data.csv",
  delimiter = ",",
  stringsAsFactors = FALSE
  #infer_schema = FALSE


)
test_data <-fread(cx_data)

但是现在失败，并出现以下错误

Error in fread(cx_data) : 
  input= must be a single character string containing a file name, a system command containing at least one space

从R中的s3存储桶读取csv文件时'setDT'中的错误

0 个答案: