我在下面的代码中将s3中的文件读入spark框架并匿名化文件中的数据
library(data.table)
library(digest)
ct_test <- spark_read_csv(
sc,
name = "test_data",
memory = FALSE,
path = "s3://XXXXXXX/sunny/Sample_data.csv",
header = TRUE,
delimiter = ",",
stringsAsFactors = FALSE
)
cols_to_mask <- c("Email","Phone")
anonymize <- function(x, algo="crc32") {
sapply(x, function(y) if(y == "" | is.na(y)) "" else digest(y, algo = algo))
}
setDT(ct_test)
ct_test[, (cols_to_mask) := lapply(.SD, anonymize), .SDcols = cols_to_mask]
print(ct_test)
但是代码失败并出现以下错误
Error in setDT(ct_test) :
All elements in argument 'x' to 'setDT' must be of same length, but the profile of input lengths (length:frequency) is: [1:1, 2:1]
The first entry with fewer than 2 entries is 1
> ct_test[, (cols_to_mask) := lapply(.SD, anonymize), .SDcols = cols_to_mask]
Error in `:=`((cols_to_mask), lapply(.SD, anonymize)) :
Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
>
任何帮助解决该问题的帮助,
下面是str(ct_test)的输出
$ ops:List of 2
..$ x : 'ident' chr "cx_data"
..$ vars: chr [1:5] "ID" "Name" "Email" "Phone" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:4] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
输入数据集
ID,Name,Email,Phone,Survey
10,Ravi,test@gmail.com,874589,Survey 1
20,John,abc@gmail.com,878756,Survey 2
30,Smith,tt@yahoo.com,565656,Survey 3
40,Kevin,,,Survey 3
根据建议,更改代码如下
cx_data <- spark_read_csv(
sc,
name = "cx_data",
memory = FALSE,
path = "s3://xxxx/sunny/Sample_data.csv",
delimiter = ",",
stringsAsFactors = FALSE
#infer_schema = FALSE
)
test_data <-fread(cx_data)
但是现在失败,并出现以下错误
Error in fread(cx_data) :
input= must be a single character string containing a file name, a system command containing at least one space