我有一个以下列格式生成的CSV:
"Serial","Long","Lat","Date","VariableX"
300,51.5068,-0.0725,"9/Feb/2014 13:03:09",10
300,51.5068,-0.0725,"9/Feb/2014 13:03:10",20
300,51.5068,-0.0725,"9/Feb/2014 13:03:11",30
300,51.5068,-0.0725,"9/Feb/2014 13:03:12",40
"Serial","Long","Lat","Date","VariableY"
300,51.5068,-0.0725,"9/Feb/2014 13:03:11",3.5
300,51.5068,-0.0725,"9/Feb/2014 13:03:12",4.2
300,51.5068,-0.0725,"9/Feb/2014 13:03:13",3.9
300,51.5068,-0.0725,"9/Feb/2014 13:03:14",4.1
我想要做的是将其重新排列为以下格式:
"Serial","Long","Lat","Date","VariableX","VariableY"
300,51.5068,-0.0725,"9/Feb/2014 13:03:09",10,
300,51.5068,-0.0725,"9/Feb/2014 13:03:10",20,
300,51.5068,-0.0725,"9/Feb/2014 13:03:11",30,3.5
300,51.5068,-0.0725,"9/Feb/2014 13:03:12",40,4.2
300,51.5068,-0.0725,"9/Feb/2014 13:03:13",,3.9
300,51.5068,-0.0725,"9/Feb/2014 13:03:14",,4.1
我尝试这样做的方法是搜索CSV,找到每次出现的“Serial”以获取行,然后将这些行拆分为单独的数据帧,然后通过匹配Date列将它们合并回来。我没有那么远,但我认为它会留下任何不匹配的列为空。
在RI尝试使用readlines,我能够识别每个表的位置,但我认为它没有正确地挑选数据列,所以我切换回read.csv并尝试合并但我一直在以下错误:Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
答案 0 :(得分:0)
这应该可以帮助您入门,但由于您的示例数据似乎缺少某些信息,因此未经测试:
## Read the file into R
x <- readLines("~/test.txt") ## Replace with your actual path/filename
## Convert to data.frames and merge
Reduce(function(x, y)
merge(x, y, by = c("Serial", "Long", "Lat", "Date"), all = TRUE),
lapply(split(x, cumsum(grepl("Serial", x))),
function(y) read.csv(text = y)))
答案 1 :(得分:0)
试试这个。 Raw
是一个字符向量,每个组件有一行。 is.hdr
是一个逻辑向量,表示哪些行是标题行。 DF
是在没有标题的情况下形成数据框的数据。 varnames
是VariableX
,VariableY
,...名称的字符向量。 Time
是一个字符向量,每行DF
有一个组件,给出与该行关联的变量名称。最后,我们使用dcast
来形成结果。
Lines <- '"Serial","Long","Lat","Date","VariableX"
300,51.5068,"9/Feb/2014 13:03:09",10
300,51.5068,"9/Feb/2014 13:03:10",20
300,51.5068,"9/Feb/2014 13:03:11",30
300,51.5068,"9/Feb/2014 13:03:12",40
"Serial","Long","Lat","Date","VariableY"
300,51.5068,"9/Feb/2014 13:03:11",3.5
300,51.5068,"9/Feb/2014 13:03:12",4.2
300,51.5068,"9/Feb/2014 13:03:13",3.9
300,51.5068,"9/Feb/2014 13:03:14",4.1'
library(reshape2)
# Raw <- readLines("myfile.dat")
Raw <- readLines(textConnection(Lines))
is.hdr <- grepl("Serial", Raw)
DF <- read.table(text = Raw[!is.hdr], sep = ",")
names(DF) <- c("Long", "Lat", "Date", "Variables")
varnames <- gsub('.*,"|"$', "", Raw[is.hdr])
Time <- varnames[cumsum(is.hdr)[!is.hdr]]
dcast(data = DF, Lat + Long + Date ~ Time)
给出:
Using Variables as value column: use value.var to override.
Lat Long Date VariableX VariableY
1 51.5068 300 9/Feb/2014 13:03:09 10 NA
2 51.5068 300 9/Feb/2014 13:03:10 20 NA
3 51.5068 300 9/Feb/2014 13:03:11 30 3.5
4 51.5068 300 9/Feb/2014 13:03:12 40 4.2
5 51.5068 300 9/Feb/2014 13:03:13 NA 3.9
6 51.5068 300 9/Feb/2014 13:03:14 NA 4.1
修改根据海报的评论。