使用R中的多个表重新组织CSV

时间:2014-02-24 17:13:57

标签: r csv dataframe

我有一个以下列格式生成的CSV:

"Serial","Long","Lat","Date","VariableX"
300,51.5068,-0.0725,"9/Feb/2014 13:03:09",10
300,51.5068,-0.0725,"9/Feb/2014 13:03:10",20
300,51.5068,-0.0725,"9/Feb/2014 13:03:11",30
300,51.5068,-0.0725,"9/Feb/2014 13:03:12",40
"Serial","Long","Lat","Date","VariableY"
300,51.5068,-0.0725,"9/Feb/2014 13:03:11",3.5
300,51.5068,-0.0725,"9/Feb/2014 13:03:12",4.2
300,51.5068,-0.0725,"9/Feb/2014 13:03:13",3.9
300,51.5068,-0.0725,"9/Feb/2014 13:03:14",4.1

我想要做的是将其重新排列为以下格式:

"Serial","Long","Lat","Date","VariableX","VariableY"
300,51.5068,-0.0725,"9/Feb/2014 13:03:09",10,
300,51.5068,-0.0725,"9/Feb/2014 13:03:10",20,
300,51.5068,-0.0725,"9/Feb/2014 13:03:11",30,3.5
300,51.5068,-0.0725,"9/Feb/2014 13:03:12",40,4.2
300,51.5068,-0.0725,"9/Feb/2014 13:03:13",,3.9
300,51.5068,-0.0725,"9/Feb/2014 13:03:14",,4.1

我尝试这样做的方法是搜索CSV,找到每次出现的“Serial”以获取行,然后将这些行拆分为单独的数据帧,然后通过匹配Date列将它们合并回来。我没有那么远,但我认为它会留下任何不匹配的列为空。

在RI尝试使用readlines,我能够识别每个表的位置,但我认为它没有正确地挑选数据列,所以我切换回read.csv并尝试合并但我一直在以下错误:Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

2 个答案:

答案 0 :(得分:0)

这应该可以帮助您入门,但由于您的示例数据似乎缺少某些信息,因此未经测试:

## Read the file into R
x <- readLines("~/test.txt") ## Replace with your actual path/filename

## Convert to data.frames and merge
Reduce(function(x, y) 
  merge(x, y, by = c("Serial", "Long", "Lat", "Date"), all = TRUE), 
       lapply(split(x, cumsum(grepl("Serial", x))), 
              function(y) read.csv(text = y)))

答案 1 :(得分:0)

试试这个。 Raw是一个字符向量,每个组件有一行。 is.hdr是一个逻辑向量,表示哪些行是标题行。 DF是在没有标题的情况下形成数据框的数据。 varnamesVariableXVariableY,...名称的字符向量。 Time是一个字符向量,每行DF有一个组件,给出与该行关联的变量名称。最后,我们使用dcast来形成结果。

Lines <- '"Serial","Long","Lat","Date","VariableX"
300,51.5068,"9/Feb/2014 13:03:09",10
300,51.5068,"9/Feb/2014 13:03:10",20
300,51.5068,"9/Feb/2014 13:03:11",30
300,51.5068,"9/Feb/2014 13:03:12",40
"Serial","Long","Lat","Date","VariableY"
300,51.5068,"9/Feb/2014 13:03:11",3.5
300,51.5068,"9/Feb/2014 13:03:12",4.2
300,51.5068,"9/Feb/2014 13:03:13",3.9
300,51.5068,"9/Feb/2014 13:03:14",4.1'

library(reshape2)

# Raw <- readLines("myfile.dat")
Raw <- readLines(textConnection(Lines))
is.hdr <- grepl("Serial", Raw)

DF <- read.table(text = Raw[!is.hdr], sep = ",") 
names(DF) <- c("Long", "Lat", "Date", "Variables")

varnames <- gsub('.*,"|"$', "", Raw[is.hdr])
Time <- varnames[cumsum(is.hdr)[!is.hdr]]

dcast(data = DF, Lat + Long + Date ~ Time)

给出:

Using Variables as value column: use value.var to override.
      Lat Long                Date VariableX VariableY
1 51.5068  300 9/Feb/2014 13:03:09        10        NA
2 51.5068  300 9/Feb/2014 13:03:10        20        NA
3 51.5068  300 9/Feb/2014 13:03:11        30       3.5
4 51.5068  300 9/Feb/2014 13:03:12        40       4.2
5 51.5068  300 9/Feb/2014 13:03:13        NA       3.9
6 51.5068  300 9/Feb/2014 13:03:14        NA       4.1

修改根据海报的评论。