使用包含千位分隔符

时间:2015-10-06 13:05:07

标签: r csv

我正在尝试阅读的csv文件具有以下格式:

Date,x,y
"2015/08/01","71,131","20,390"
"2015/08/02","81,599","23,273"
"2015/08/03","79,435","21,654"
"2015/08/04","80,733","20,924"

分隔符是逗号,但每个值也用引号括起来,因为逗号用作千位分隔符。我尝试了来自{readr}的read.csvread_csv和来自{data.table}的fread,我能做的最好就是将所有值都读为字符串,然后使用{{}的组合1}}和as.numeric将它们转换为数字。

我也发现了这个:Most elegant way to load csv with point as thousands separator in R它非常有用,但我的数据有很多列(不是所有的数字),我宁愿不指定列类型。

任何想法或我应该开始gsubing?有趣的是,Excel读取文件很好:)

3 个答案:

答案 0 :(得分:2)

您应该能够使用read.csv阅读数据。这是一个例子

#write data
write('Date,x,y\n"2015/08/01","71,131","20,390"\n"2015/08/02","81,599","23,273"\n"2015/08/03","79,435","21,654"\n"2015/08/04","80,733","20,924"',"test.csv")

#use "text" rather than "file" in read.csv
#perform regex substitution before using read.csv
#the outer gsub with '(?<=\\d),(\\d{3})(?!\\d)' performs the thousands separator substitution
#the inner gsub replaces all \" with '
read.csv(text=gsub('(?<=\\d),(\\d{3})(?!\\d)',
                   '\\1',
                   gsub("\\\"",
                        "'",
                        paste0(readLines("test.csv"),collapse="\n")),
                   perl=TRUE),
         header=TRUE,
         quote="'",
         stringsAsFactors=FALSE)

结果

#        Date     x     y
#1 2015/08/01 71131 20390
#2 2015/08/02 81599 23273
#3 2015/08/03 79435 21654
#4 2015/08/04 80733 20924

答案 1 :(得分:2)

使用data.table包,您可以按照以下方式执行此操作:

1:创建要转换的列名向量。在这种情况下,必须排除Date

cols <- setdiff(names(dt),"Date")

2:将转换函数应用于其余列:

library(data.table)
dt[, (cols) := lapply(.SD, function(x) as.numeric(gsub(",", "", x))), .SDcols = cols]

这导致:

> dt
         Date     x     y
1: 2015/08/01 71131 20390
2: 2015/08/02 81599 23273
3: 2015/08/03 79435 21654
4: 2015/08/04 80733 20924

使用过的数据:

dt <- fread('Date,x,y
            "2015/08/01","71,131","20,390"
            "2015/08/02","81,599","23,273"
            "2015/08/03","79,435","21,654"
            "2015/08/04","80,733","20,924"')

答案 2 :(得分:0)

最佳解决方案是在导出Excel表格之前删除所有这些格式。

如果不这样做,只需使用lapply转换每列:

df[c("x", "y")] <- lapply(df[c("x", "y")], function(x) as.numeric(gsub(",", "", x)))