Question

我有自动生成的文件，我无法更改格式。我希望能够以适当的R格式存储数据。他们就是这样：

File: /path/to/file


Start Date: 07/05/16
Subject: 0
Start Time: 10:01:09
Name: FooBar
K:       0.000
O:       0.000
A: 
    0:       91.600       65.000      238.000       31.000       24.000
    5:        7.000    22162.000       78.000       10.000    20000.000
   10:       55.000        0.000        2.000        6.000       53.000
B:
    0:        0.000        2.000        1.000        1.000        1.000
    5:        1.000        1.000        1.000        1.000        1.000

[...] # Goes all the way to Z
Start Date: 07/05/16
Subject: 8
Start Time: 10:11:09
Name: JohnDoe
K:       0.000
O:       0.000
A:
    0:       91.600       65.000      238.000       31.000       24.000
[...] # Goes all the way to Z

我使用readLines打开文件，因此每行都是一个长字符。每个文件包含多个会话，这些会话由日期，名称，主题和时间标识。每个会话包含多个表示字母表的数字变量（LETTERS）。例如，在第一个会话（FooBar）中，K可以表示为c(0.000)，B可以表示为

c(0.000,2.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000,1.000)

第一行（文件，开始日期，开始时间，名称）是我能够在该数据帧中保存的会话的信息：

#Sessions data.frame
structure(list(`Start Date` = c("07/05/16", "07/05/16"), Subject = c("0", "8"), `Start Time` = c("10:01:09", 
"10:11:09"), Name = c("FooBar", 
"JohnDoe"
)), .Names = c("Start Date", "Subject", "Start Time", "name"), row.names = 1:2, class = "data.frame")

他们是我正在努力的两件事

如何将变量（A-Z）保存为数字向量？
如何构建这些数字向量，以便为每个会话检索它们？

我考虑过apply，startsWith和scan的组合，但我无法找到构建数据的最佳方式。

Answer 1

也许不完全是你所追求的，但这是我能想到的最好的一个缺点，那就是代码在循环中增长，因为我们无法猜测如何矢量将提前完成。
我们无法创建data.frame，因为A，B等矢量不是相同的长度（它需要用NA填充它们，但听起来根本不感兴趣）

sessions <- list()
sd <- subject <- st <- sname <- cvec <- ""
lines = readLines("c:/tmp/test.txt")
cases <- c("Start Date:", "Subject:", "Start Time:", "Name:", LETTERS, " ")
lnames <- c("SDate", "Subject", "STime", "Name", LETTERS)
for (l in lines) { # loop on line
  if (nchar(l) < 2) # skip lines with less than 1 char ( A: )
    next
  v <- lnames[min(which(startsWith(l, cases)))] # Get the "field" name
  fields <- strsplit(l, " ")[[1]]
  # Here comes the fun, for each case store the value or update a vector
  if (is.na(v)) { # No field, it's a line of the form "spaces digit: space separated values"
    vals <- fields[nchar(fields) > 1]
    sessions[[sname]][[cvec]] <-
      c(sessions[[sname]][[cvec]], as.integer(vals[-1])) # We just concatenate with previous value for this letter
  }
  else if (v == "SDate")
    sd <- fields[3]
  else if (v == "Subject")
    subject = fields[2]
  else if (v == "STime")
    st <- fields[3]
  else if (v == "Name") {
    sname <- fields[2]
    # Create a new session list entry
    sessions[[sname]] = list(
      "SDate" = sd,
      "STime" = st,
      "Subject" = as.numeric(subject)
    )
  }
  else if (any(v %in% LETTERS)) { # Swich letter vector, use on line value if there's some
    cvec <- v
    sessions[[sname]][[cvec]] <- vector("numeric")
    if (length(fields) > 1) {
      vals <- fields[-1]
      sessions[[sname]][[cvec]] <- as.numeric(vals[nchar(vals) > 1])
    }
  }
}

这会创建一个列表列表：

> str(sessions)
List of 2
 $ FooBar :List of 7
  ..$ SDate  : chr "07/05/16"
  ..$ STime  : chr "10:01:09"
  ..$ Subject: num 0
  ..$ K      : num 0
  ..$ O      : num 0
  ..$ A      : num [1:15] 91 65 238 31 24 ...
  ..$ B      : num [1:10] 0 2 1 1 1 1 1 1 1 1
 $ JohnDoe:List of 7
  ..$ SDate  : chr "07/05/16"
  ..$ STime  : chr "10:11:09"
  ..$ Subject: num 8
  ..$ K      : num 0
  ..$ O      : num 0
  ..$ A      : num [1:15] 91 65 238 31 24 ...
  ..$ B      : num [1:10] 0 2 1 1 1 1 1 1 1 1

这给会话＆＃34; FooBar＆＃34;：

sessions$FooBar
$SDate
[1] "07/05/16"

$STime
[1] "10:01:09"

$Subject
[1] 0

$K
[1] 0

$O
[1] 0

$A
 [1]    91    65   238    31    24     7 22162    78    10 20000    55     0     2     6    53

$B
 [1] 0 2 1 1 1 1 1 1 1 1

R存储来自未格式化文本文件的多个向量

1 个答案: