在将CSV文件读入.Xdf文件时使用双引号抑制逗号使用rxImport函数

时间:2015-06-27 09:06:01

标签: r csv bigdata revolution-r

我正在尝试使用.CSV函数将大型.Xdf文件转换为rxImport()文件,代码如下:

rxImport(inData = "/poc/revor/data/ext_roll36_chrg_vol.csv",
         outFile = "/poc/revor/data/ext_roll36_chrg_vol.xdf", 
         overwrite = TRUE, rowsPerRead = 100000,
         colClasses = c(SE_NO = "character", 
                        HIER_ROLLUP_CD = "character", 
                        CUR_MO_CT ="numeric", 
                        CUR_MO_AM = "numeric", 
                        AD_LINE_1_TX = "character",
                        AD_LINE_2_TX = "character",
                        SUBMIT_DT = "character", 
                        UPDT_TS = "character"),
         transforms = list(SUBMIT_DT = as.Date(SUBMIT_DT, format="%d%b%Y")))

但是这个文件包含许多记录,如:

0200001097,SS,625,236899.000,"KRAV MAGA WORLDWIDE, INC.","KRAV MAGA WORLDWIDE, INC.",01MAY2014,07JUN2014:01:08:57.000000

正如您可以看到列AD_LINE_1_TX& AD_LINE_2_TX在双引号内包含逗号。

我已尝试使用type = "text"参数,但随后将第SE_NO列为numeric,即使其类型显示为character。这是我希望以numeric读取的所有character字段的问题。

如果我使用transform参数将列转换为character

transforms = list(SE_NO = as.character(as.numeric(SE_NO)))

然后SE_NO列的值在从字符(指数表示)0200001097到数字的转换中从0200001000更改为2.000011e+08

那么有没有其他方法来抑制双引号内的逗号而不影响其他列?

如果需要进一步的信息,请告诉我。

1 个答案:

答案 0 :(得分:0)

这应该可以满足您的需求......

input_file <- "/poc/revor/data/ext_roll36_chrg_vol.csv"
output_file <- "/poc/revor/data/ext_roll36_chrg_vol.xdf"

my_colInfo <- list(list(index = 1, type = "character", newName = "SE_NO"),
                   list(index = 2, type = "character", newName = "HIER_ROLLUP_CD"),
                   list(index = 3, type = "numeric", newName = "CUR_MO_CT"),
                   list(index = 4, type = "numeric", newName = "CUR_MO_AM"),
                   list(index = 5, type = "character", newName = "AD_LINE_1_TX"),
                   list(index = 6, type = "character", newName = "AD_LINE_2_TX"),
                   list(index = 7, type = "character", newName = "SUBMIT_DT"),
                   list(index = 8, type = "character", newName = "UPDT_TS"))

input_source <- RxTextData(file = input_file, 
                           colInfo = my_colInfo,
                           delimiter = ",",
                           quotedDelimiters = TRUE,
                           useFastRead = TRUE)

rxImport(inData = input_source,
         outFile = output_file, 
         overwrite = TRUE, rowsPerRead = 100000,
         transforms = list(SUBMIT_DT = as.Date(SUBMIT_DT, format="%d%b%Y")))