使用R中的多个分隔线有效地读取数据

时间:2013-07-12 01:51:03

标签: r

我有一个像这样的样本数据集:

 8  02-Model (Minimum)
250.04167175293  17.4996566772461
250.08332824707  17.5000038146973
250.125  17.5008907318115
250.16667175293  17.5011672973633
250.20832824707  17.5013771057129
250.25   17.502140045166
250.29167175293  17.5025615692139
250.33332824707  17.5016822814941
 7  03 (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506

数据文件的第一列表示该特定数据的行数(即02-MOdel(最小值))。然后在8行之后我有另一行7 03 (Maximum),这意味着03(最大)我将有7行数据。

我写的函数如下:

readts <- function(x)
{
  path <- x
  # Read the first line of the file
  hello1 <- read.table(path, header = F, nrows = 1,sep="\t")
  tmp1 <- hello1$V1
  # Read the data below first line
  hello2 <- read.table(path, header = F, nrows = (tmp1), skip = 1, 
                       col.names = c("Time", "value"))
  hello2$name <- c(as.character(hello1$V2))
  # Read data for the second chunk
  hello3 <- read.table(path, header = F, skip = (tmp1 + 1), 
                       nrows = 1,sep="\t")
  tmp2 <- hello3$V1
  hello4 <- read.table(path, header = F, skip = (tmp1 + 2), 
                       col.names = c("Time", "value"),nrows=tmp2)
  hello4$name <- c(as.character(hello3$V2))
  # Combine data to create a dataframe
  df <- rbind(hello2, hello4)
  return(df)
}

我得到的输出如下:

> readts("jdtrial.txt")
       Time    value               name
1  250.0417 17.49966 02-Model (Minimum)
2  250.0833 17.50000 02-Model (Minimum)
3  250.1250 17.50089 02-Model (Minimum)
4  250.1667 17.50117 02-Model (Minimum)
5  250.2083 17.50138 02-Model (Minimum)
6  250.2500 17.50214 02-Model (Minimum)
7  250.2917 17.50256 02-Model (Minimum)
8  250.3333 17.50168 02-Model (Minimum)
9  250.0417 17.50206       03 (Maximum)
10 250.0833 17.50115       03 (Maximum)
11 250.1250 17.50113       03 (Maximum)
12 250.1667 17.50124       03 (Maximum)
13 250.2083 17.50160       03 (Maximum)
14 250.2500 17.50247       03 (Maximum)
15 250.2917 17.50432       03 (Maximum)

jdtrial.txt是我上面显示的数据。但是,当我有大型数据与多个分隔符时,我的功能不起作用,我需要添加更多的行,这使得函数更加混乱。有没有更简单的方法来读取这样的数据文件?感谢。

预期的数据是我得到的数据。您可以尝试使用的数据:

 8  02-Model (Minimum)
250.04167175293  17.4996566772461
250.08332824707  17.5000038146973
250.125  17.5008907318115
250.16667175293  17.5011672973633
250.20832824707  17.5013771057129
250.25   17.502140045166
250.29167175293  17.5025615692139
250.33332824707  17.5016822814941
 7  03 (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506
 8  04-Model (Maximum)
250.04167175293  17.5020561218262
250.08332824707  17.501148223877
250.125  17.501127243042
250.16667175293  17.5012378692627
250.20832824707  17.5016021728516
250.25   17.5024681091309
250.29167175293  17.5043239593506
250.33332824707  17.5055828094482

4 个答案:

答案 0 :(得分:3)

不清楚多个分隔符指的是什么,但这里是一个解决您实际显示的数据的解决方案。

使用fill=TRUE读取数据以填充空白字段。使用is.hdr跟踪哪些行是标题。将V2转换为数字(在标题行中将V2替换为NA,这样它们就不会生成警告)。然后在接下来的两列中用NAs替换非标题行,并使用na.locf (link)用标题填充NA。最后,只保留非标题行。

library(zoo)
DF <- read.table("jdtrial.txt", fill = TRUE, as.is = TRUE)

is.hdr <- DF$V3 != ""
transform(DF, 
    V2 = as.numeric(replace(V2, is.hdr, NA)),
    V3 = na.locf(ifelse(is.hdr, V2, NA)),
    name = na.locf(ifelse(is.hdr, V3, NA)))[!is.hdr, ]

最后一个陈述的结果是:

         V1       V2       V3      name
2  250.0417 17.49966 02-Model (Minimum)
3  250.0833 17.50000 02-Model (Minimum)
4  250.1250 17.50089 02-Model (Minimum)
5  250.1667 17.50117 02-Model (Minimum)
6  250.2083 17.50138 02-Model (Minimum)
7  250.2500 17.50214 02-Model (Minimum)
8  250.2917 17.50256 02-Model (Minimum)
9  250.3333 17.50168 02-Model (Minimum)
11 250.0417 17.50206       03 (Maximum)
12 250.0833 17.50115       03 (Maximum)
13 250.1250 17.50113       03 (Maximum)
14 250.1667 17.50124       03 (Maximum)
15 250.2083 17.50160       03 (Maximum)
16 250.2500 17.50247       03 (Maximum)
17 250.2917 17.50432       03 (Maximum)
19 250.0417 17.50206 04-Model (Maximum)
20 250.0833 17.50115 04-Model (Maximum)
21 250.1250 17.50113 04-Model (Maximum)
22 250.1667 17.50124 04-Model (Maximum)
23 250.2083 17.50160 04-Model (Maximum)
24 250.2500 17.50247 04-Model (Maximum)
25 250.2917 17.50432 04-Model (Maximum)
26 250.3333 17.50558 04-Model (Maximum)

答案 1 :(得分:1)

这是一个似乎适用于您的示例数据的函数。它会返回list data.frame个{},但如果您愿意,可以使用do.call(rbind, ...)获得一个data.frame

myFun <- function(textfile) {
  # Read the lines of your text file
  x <- readLines(textfile)
  # Identify lines that start with space followed
  #  by numbers followed by space followed by
  #  numbers. By the looks of it, matching the
  #  space at the start of the line might be
  #  sufficient at this stage.
  myMatch <- grep("^\\s[0-9]+\\s+[0-9]+", x)
  # Extract the first number, which tells us how
  #  many values need to be read in.
  scanVals <- as.numeric(gsub("^\\s+([0-9]+)\\s+.*", 
                              "\\1", x[myMatch]))
  # Extract. I've used seq_along which is like 
  #  1:length(myMatch)
  temp <- lapply(seq_along(myMatch), function(y) {
    # scan will return just a single vector, but your
    #  data are in pairs, so we convert the vector to
    #  a matrix filled in by row
    t1 <- matrix(scan(textfile, skip = myMatch[y], 
                      n = scanVals[y]*2), ncol = 2, 
                 byrow = TRUE)
    # Add column names to the matrix
    colnames(t1) <- c("time", "value")
    # Convert the matrix to a data.frame and add the 
    #  name column using cbind.
    cbind(data.frame(t1), 
          name = gsub("^\\s+([0-9]+)\\s+(.*)", "\\2", 
                      x[myMatch])[y])
  })
  # Return the list we just created
  temp
}

示例用法是:

myFun("mytest.txt")                  ## list output

do.call(rbind, myFun("mytest.txt"))  ## Single data.frame

答案 2 :(得分:1)

使用readLines读取数据,然后按顺序执行每个数据块。这避免了必须对模型名称进行假设或摆弄正则表达式。你必须使用循环而不是[sl]apply,但实际上,这并没有错。

readFile <- function(file)
{
    con <- readLines(file)
    i <- 1
    chunks <- list()
    while(i < length(con))
    {
        type <- scan(text=con[i], what=character(2), sep="\t")
        nlines <- as.numeric(type[1])
        dat <- cbind(read.delim(text=con[i+seq_len(nlines)], header=FALSE),
                     type=type[2])
        chunks <- c(chunks, list(dat))
        i <- i + nlines + 1
    }
    do.call(rbind, chunks)
}

答案 3 :(得分:1)

编辑以根据@ G.Grothendieck更好的答案替换我原来的答案。这在很大程度上是对这个答案的一种变化。

另一个去,为了演示的目的,test只是原始文本,如:

test <-" 1  02-Model (Minimum)
250.04167175293  17.4996566772461
 1  03 (Maximum)
250.04167175293  17.5020561218262
 1  04-Model (Maximum)
250.04167175293  17.5020561218262"

处理它:

interm <- read.table(
  text = test, fill = TRUE, as.is = TRUE,
  col.names=c("Time","Value","Name")
)

keys <- which(interm$Name != "")

interm$Name <- rep(
  apply(interm[keys,][-1],1,paste0,collapse=""), 
  diff(c(keys,nrow(interm)+1))
)

result <- interm[-(keys),]

结果:

      Time            Value              Name
2 250.0417 17.4996566772461 02-Model(Minimum)
4 250.0417 17.5020561218262       03(Maximum)
6 250.0417 17.5020561218262 04-Model(Maximum)