我有一个软件可以生成宽度有限的实验数据,这样一串数据点将被包装成最终csv中限制为4列宽的一系列行,而不是每个变量只有一行(下面的A和B),这是我需要它的形式。(下面的样本csv)
A,1,3,3,2
,5,6,7,8
,9,10,11,12
,13,1,15,6
,17,1,2,20
B,1,2,3,7
,7,6,7,8
,9,10,11,12
,13,15,15,16
,17,18,3,2
在实际数据中,每天要处理大约53,000行,所以我想知道是否有一个函数可以让我打开或重新标注给定的数据子集(每个变量)成单行。在上面的例子中,变量A之后的数字将组合成一行,同时保持顺序(即1,3,3,2,5 ......),B也是如此,等等。
根据请求,dput输出生成上面的简化示例..
structure(list(V1 = structure(c(2L, 1L, 1L, 1L, 1L, 3L), .Label = c("",
"A", "B"), class = "factor"), V2 = c(1L, 5L, 9L, 13L, 17L, 1L
), V3 = c(2L, 6L, 10L, 14L, 18L, 2L), V4 = c(3L, 7L, 11L, 15L,
19L, 3L), V5 = c(4L, 8L, 12L, 16L, 20L, 4L)), .Names = c("V1",
"V2", "V3", "V4", "V5"), row.names = c(NA, 6L), class = "data.frame")
答案 0 :(得分:3)
您可以使用外部工具预处理文件,
read.csv(pipe("sed -e :a -e '$!N;s/\\n,//;ta' -e 'P;D' file.txt"), head=FALSE)
基本上,file.txt
首先由unix工具sed
处理,它执行搜索和替换并将新内容返回给R.我从this page改编的正则表达式执行以下任务:
If a line begins with a comma, append it to the previous line
and replace the "," with nothing
编辑(eddi - 注意:这似乎不适用于Mac OS)以下是sed
解析以下命令的方式:
read.csv(pipe("sed ':a; N; s/\\n,/,/; t a; P; D' file.txt"), head=FALSE)
:a # label (named "a") we're going to come back to
N # read in the next line into pattern space, together with the newline character
s/\n,/,/ # if there is a newline followed by comma, delete the newline
t a # go back to "a" and repeat until the above match fails (t stands for test)
P # print everything in pattern space up to and including last \n
D # delete everything in pattern space up to and including last \n
答案 1 :(得分:2)
grep,paste& read.table在这里非常方便。
# read in your data raw
X <- read.table("file")
# Any line that does NOT start with a comma, add a line break,
# then re-read with read.table
read.table(text=paste(ifelse(grepl("^,", X), X, paste("\n", X)), collapse=""), sep=",")
收益率:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 A 1 3 3 2 5 6 7 8 9 10 11 12 13 1 15 6 17 1 2 20
2 B 1 2 3 7 7 6 7 8 9 10 11 12 13 15 15 16 17 18 3 2
答案 2 :(得分:2)
这是另一个基础R解决方案。它使用gsub()
,简短易读(至少对我而言)。
txt = readLines("file.txt")
# Join into one long string with newlines.
txt_long = paste(txt, collapse="\n")
# Remove newlines directly preceding a comma.
newtxt = gsub("\\n,", ",", txt_long)
read.table(text=newtxt, sep=",")
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
# 1 A 1 3 3 2 5 6 7 8 9 10 11 12 13 1 15 6 17 1 2 20
# 2 B 1 2 3 7 7 6 7 8 9 10 11 12 13 15 15 16 17 18 3 2
答案 3 :(得分:1)
这有点难看,但这是我想到的第一个总策略:
library(zoo)
library(plyr)
dat$V1 <- na.locf(dat$V1)
> ddply(dat,.(V1),function(x) c(t(as.matrix(x[,-1]))))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
1 1 3 3 2 5 6 7 8 9 10 11 12 13 1 15 6 17 1 2 20
2 1 2 3 7 7 6 7 8 9 10 11 12 13 15 15 16 17 18 3 2
假设您将数据读入名为dat
的对象并使用na.strings = ""
。您可以在之后添加A
,B
变量信息,或者将其填充到匿名ddply
函数中。
可能有一种方法可以使用dcast
直接重塑它,但我想不出办法。
答案 4 :(得分:1)
难道你不喜欢乐器制造商吗?
这是一种方法,我不认为它是完美的,因为我无法完全测试所有数据,但你可以。
编辑:更新功能
cleanData <- function(df) {
good <- c() # holds indices of lines that start a row in the final data set
# Find the 'starter' rows
for (n in 1:nrow(df)) {
if (df[n,1] != "") good <- c(good,n)
}
# Now go back and put it back together
# Get one row in 1st to set dimensions
newDat <- data.frame(mydat = df[(good[1]:(good[2])-1),])
offset <- nrow(newDat)-1
data <- as.numeric(t(as.matrix(newDat[,-1])))
label <- df[1,1]
newDat <- data.frame(data)
names(newDat) <- label
#print(newDat) # OK
# now do them all
for (n in 2:length(good)) {
use <- good[n]:(good[n] + offset)
data <- as.numeric(t(as.matrix(df[use,-1])))
label <- df[good[n],1]
newCol <- data.frame(data)
names(newCol) <- label
newDat <- cbind(newDat, newCol)
}
newDat
}
将上述功能复制并粘贴到R
,然后newTst <- cleanData(tst)
tst
,read.csv
是您newTst
的数据框。如果有效,请查看str(newTst)
或'data.frame': 20 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10 ...
$ B: num 1 2 3 4 NA NA NA NA NA NA ...
。
在测试数据上,它给出了:
{{1}}