Question

我从一台设备输出数据。不幸的是，输出数据组织得不是很好，我一直在用R编写代码来分解它。基本上，数据是粘贴到一个长文档中的每个主题的单独信息列表（基本描述信息和每个时间间隔的两个不同测量A和B的原始数据）。例如：

Date: 01/01/2016
Time: 12:00:00 
Subject: Subject1
A: 
1: 1  2  4  1 
2: 2  1  2  3
3: 1  0  2  7
B:
1: 2  3  0  1
2: 4  1  1  2 
3: 3  5  2  8

Date: 01/01/2016
Time: 12:00:00 
Subject: Subject2   
A: 
1: 8  2  0  1 
2: 9  1  2  7
3: 1  6  2  7
B:
1: 2  3  2  0
2: 6  7  1  2
3: 3  3  2  4

我在R中编写了一个有效的代码，但不是很优雅，使用split（seq_along），for-loops和do.call（主要基于this stack overflow question和this blog post）。

# First read text file in as a character vector called ‘example’

    scan("example_file.txt", what="character", strip.white=T, sep="\n") -> example

# Separate the header text (before the colon) from the proceeding data
# and make that text name the components of the vector

    regmatches(example, regexpr(example, pattern="[[:alnum:]]+:", useBytes = F)) -> names(example)
    gsub(example, pattern="[[:print:]]+: ", replacement="", useBytes = F)-> example.2

# Then, split character vector into a list based on how many lines are
# dedicated to each subject (in this example, 11 lines); based on SE
# answer cited above

    strsplit(example.2, "([A-Z]:)") -> example.3
    split(as.list(example.3), ceiling(seq_along(example.2)/11)) -> example.4

# Use a for-loop to systematically add the data together for subjects 1
# and 2 for time interval 1, using the method detailed from a blog post
# (cited above)

    my.list <- list()

    for(i in 1:2){
            strsplit(as.character(example.4[[i]][5]), split="[[:blank:]]+") -> A
            strsplit(as.character(example.4[[i]][9]), split="[[:blank:]]+")-> B

            as.vector(c(as.character(example.4[[i]][3]), "A", unlist(A))) -> A_char
            as.vector(c(as.character(example.4[[i]][3]), "B", unlist(B))) -> B_char

            paste(as.character(example.4[[i]][3]), "Measure_A") -> a_name
            paste(as.character(example.4[[i]][3]), "Measure_B") -> b_name

            my.list[[a_name]] <- A_char
            my.list[[b_name]] <- B_char
    }

    final.data <- do.call(rbind, my.list)
    as.data.frame(final.data) -> final.data

    names(final.data) <- c("Subject", "Measure", "V1", "V2", "V3", "V4")

我可以使用我的代码在所有科目中提取A和B的单个时间间隔的数据（例如，上面的行“1：1 2 4 1”和“1：2 3 0 1”）并放入将所有信息放在一个数据框中。如果我想为所有的时间间隔执行此操作，而不仅仅是一个时间间隔，那么哪里变得混乱。如果不为每个时间间隔运行单独的for循环，我无法弄清楚如何做到这一点。我尝试在for循环中执行for循环，但这不起作用。我也无法弄清楚如何使用apply（） - 类型函数。

如果我只有3个时间间隔，根据这个例子，这个问题不会那么糟糕，但我的实际数据要长得多。任何有关更优雅和简洁方法的建议都将受到赞赏！

P.S。我知道上面代码给出的最终数据框有冗余的行名。但是，这是一种有用的方法，可以确保最终数据框的主题和度量信息与我应用于早期R对象的标签对齐。

Answer 1

除了rownames之外，它可以做任何事情：

lines <- readLines(textConnection("Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2

3: 3 3 2 4
3: 3 3 2 4"))

非基础R解决方案我们需要的一些库：

library(purrr)
library(tibble)
library(tidyr)
library(dplyr)

修剪空格并过滤掉空行：

trimws(lines) %>% discard(`==`, "") -> lines

这使得记录开始的lines中的索引向量（通过在行的开头找到Date:来指定）：

starts <- which(grepl("^Date:", lines))

现在，我们接受这些启动并查找Date:的 next 次出现（即下一条记录）。它会全部找到它们，所以我们只关心第一个。要计算该索引，我们添加起始索引并减去1。从理论上讲，只有一个NA（即最后一个记录）但我们懒得使用ifelse而不是改变他的最后一个记录。

ends <- map_dbl(starts, function(i) {
  which(grepl("^Date:", lines[(i+1):length(lines)]))[1]+i-1
})
ends <- ifelse(is.na(ends), length(lines), ends)

因此，现在starts包含每条记录开头的索引，ends包含每条记录末尾的索引。

map2_df()是mapply()＆amp;的非常方便的伪包装器。 do.call(rbind,…)。我们使用这些是DCF格式（key: value）并使用read.dcf()的事实。这就形成了一个矩阵，然后我们重新定位它并将其转换为data.frame。

然后我们将值分开，添加行名以生成time_interval列，添加日期，时间和主题，并确保列是正确的类型。

我们还使用map2_df()将使用命名列表＆＃34;键＆＃34;如果我们告诉它，作为专栏。

最后，我们对列重新排序。

因此，这将迭代starts和ends并将每次迭代传递到start和end：

map2_df(starts, ends, function(start, end) {

  # now, we extract just the current record into `record` by pulling
  # out lines by the indexes.

  record <- lines[start:end]

  # we then use `read.dcf` to read in the date/subject/time values:

  header <- as.data.frame(read.dcf(textConnection(record[1:3])))

  # Since we do not have blank lines and you said the records were
  # uniform we can use the fact that they'll be at known index
  # positions in this `record`. So, we make a list of two vectors
  # which are the indexes. Each becomes `i` (two total iterations)
  # and we use the value in `i` to extract out the three lines from
  # `record` and read those via `read.dcf`.

  # But that reads things into a matrix and in an unhelpful order
  # so we transpose it into shape and make it a data frame since
  # we'll ultimately need that.

  # We use `separate` to take the single character space-separated
  # `V1` column and turn it into 4 columns. `read.dcf` gave us
  # named rows for each time interval so we promote that to a 
  # full-on column and then add in date/time/subject, ensuring
  # they are characters and not factors, then ensure that the 
  # values we split out from `V1` are numeric and not character or
  # factor.

  # `map_df` can add in the `A` and `B` from the named list we passed
  # in for us and we have it call that column `measure`.

  # finally, we put the columns in a better order.

  map_df(list(A=5:7, B=9:11), function(i) {
    read.dcf(textConnection(record[i])) %>%  
      t() %>% as_data_frame() %>%
      separate(V1, sprintf("V%d", 1:4)) %>%
      rownames_to_column("time_interval") %>%
      mutate(date=as.character(header$Date),
             time=as.character(header$Time),
             subject=header$Subject) %>%
      mutate_at(vars(starts_with("V")), as.numeric)

  }, .id="measure")

}) %>% 
  select(date, time, subject, measure, time_interval, V1, V2, V3, V4)

产生以下输出：

## # A tibble: 18 x 9
##          date     time  subject measure time_interval    V1    V2    V3    V4
##         <chr>    <chr>    <chr>   <chr>         <chr> <dbl> <dbl> <dbl> <dbl>
## 1  01/01/2016 12:00:00 Subject1       A             1     1     2     4     1
## 2  01/01/2016 12:00:00 Subject1       A             2     2     1     2     3
## 3  01/01/2016 12:00:00 Subject1       A             3     1     0     2     7
## 4  01/01/2016 12:00:00 Subject1       B             1     2     3     0     1
## 5  01/01/2016 12:00:00 Subject1       B             2     4     1     1     2
## 6  01/01/2016 12:00:00 Subject1       B             3     3     5     2     8
## 7  01/01/2016 12:00:00        2       A             1     8     2     0     1
## 8  01/01/2016 12:00:00        2       A             2     9     1     2     7
## 9  01/01/2016 12:00:00        2       A             3     1     6     2     7
## 10 01/01/2016 12:00:00        2       B             1     2     3     2     0
## 11 01/01/2016 12:00:00        2       B             2     6     7     1     2
## 12 01/01/2016 12:00:00        2       B             3     3     3     2     4
## 13 01/01/2016 12:00:00        2       A             1     8     2     0     1
## 14 01/01/2016 12:00:00        2       A             2     9     1     2     7
## 15 01/01/2016 12:00:00        2       A             3     1     6     2     7
## 16 01/01/2016 12:00:00        2       B             1     2     3     2     0
## 17 01/01/2016 12:00:00        2       B             2     6     7     1     2
## 18 01/01/2016 12:00:00        2       B             3     3     3     2     4

如果您真的需要基础R解决方案，那么：

do.call(rbind, mapply(function(start, end) {

  record <- lines[start:end]
  header <- as.data.frame(read.dcf(textConnection(record[1:3])))

  do.call(rbind, lapply(list(A=5:7, B=9:11), function(i) {
    mat <- as.data.frame(t(read.dcf(textConnection(record[i]))))
    mat <- matrix(unlist(apply(mat, 1, strsplit, split=" "), use.names=FALSE), ncol=4, byrow=TRUE)
    mat <- as.data.frame(mat)
    mat$time_interval <- 1:3
    mat$date <- as.character(header$Date)
    mat$time <- as.character(header$Time)
    mat$subject <- as.character(header$Subject)
    mat
  })) -> df

  df$measure <- gsub("\\..*$", "", rownames(df))
  rownames(df) <- NULL
  df

}, starts, ends, SIMPLIFY=FALSE)) -> out_df
out_df[,c("date", "time", "subject", "measure", "time_interval", "V1", "V2", "V3", "V4")]

##          date     time  subject measure time_interval V1 V2 V3 V4
## 1  01/01/2016 12:00:00 Subject1       A             1  1  2  4  1
## 2  01/01/2016 12:00:00 Subject1       A             2  2  1  2  3
## 3  01/01/2016 12:00:00 Subject1       A             3  1  0  2  7
## 4  01/01/2016 12:00:00 Subject1       B             1  1  2  4  1
## 5  01/01/2016 12:00:00 Subject1       B             2  2  1  2  3
## 6  01/01/2016 12:00:00 Subject1       B             3  1  0  2  7
## 7  01/01/2016 12:00:00        2       A             1  8  2  0  1
## 8  01/01/2016 12:00:00        2       A             2  9  1  2  7
## 9  01/01/2016 12:00:00        2       A             3  1  6  2  7
## 10 01/01/2016 12:00:00        2       B             1  8  2  0  1
## 11 01/01/2016 12:00:00        2       B             2  9  1  2  7
## 12 01/01/2016 12:00:00        2       B             3  1  6  2  7
## 13 01/01/2016 12:00:00        2       A             1  8  2  0  1
## 14 01/01/2016 12:00:00        2       A             2  9  1  2  7
## 15 01/01/2016 12:00:00        2       A             3  1  6  2  7
## 16 01/01/2016 12:00:00        2       B             1  8  2  0  1
## 17 01/01/2016 12:00:00        2       B             2  9  1  2  7
## 18 01/01/2016 12:00:00        2       B             3  1  6  2  7

Answer 2

目前尚不清楚数据框是表示此数据的最便捷方式。以下显示了三个备用输出：

三个数组 - 一个矩阵，每个主题有一行，具有日期时间和主题列，A数组，A[,,i]是第i个主题的A矩阵和{{ 1}}数组，B是第i个主题的B矩阵。没有包使用。
广泛的数据框
长格式的数据框

没有使用任何包裹。

对于所有三个，将文件读入字符向量B[,,i]。然后使用Lines删除任何空行 - 如果我们知道没有空行，我们可以省略此步骤。然后将grep拆分为主题组Lines。然后s在主题组之上，并且在每个主题组中抓取前三行中的日期，时间和主题以及来自行5：7和9:11的两个矩阵，每个主题具有一个组件。生成列表lapply的关键代码很容易重写为不同的格式就是这样：

鉴于Lines <- readLines("example_file.txt") Lines <- grep("^\\s*$", Lines, value = TRUE, invert = TRUE) s <- split(Lines, cumsum(grepl("^Date:", Lines))) L <- lapply(s, function(x) list(read.dcf(textConnection(x[1:3])), A = as.matrix(read.table(text = sub(":", "", x[5:7]), row.names = 1)), B = as.matrix(read.table(text = sub(":", "", x[9:11]), row.names = 1)))) names(L) <- sapply(L, function(x) x[[1]][, "Subject"])，我们可以使用L轻松创建各种输出格式。三种格式中的每一种都在下面的单独部分中显示。输出显示在末尾，以免分解代码。

三个阵列

我们可以原样使用lapply，但将L转换为三个数组可能更方便：（1）L这是一个3列矩阵，其行数与主题一样多每个的日期，时间和主题，（2）ident这是一个3d数组，A是第i个主题的A[,,i]矩阵，（3）{{1}这是一个3d数组，A是第i个主题的B矩阵。

B[,,i]

data.frame-wide form

data.frame - 长格式

ident <- do.call(rbind, lapply(L, "[[", 1))
A <- simplify2array(lapply(L, "[[", 2))
B <- simplify2array(lapply(L, "[[", 3))

输出 - 三个数组

DF <- do.call(rbind, lapply(L, function(x) data.frame(x[[1]], x[[2]], x[[3]])))
names(DF)[4:7] <- "A"
names(DF)[8:11] <- "B"
rownames(DF) <- NULL

输出 - 数据框格式

DF2 <- do.call(rbind, lapply(L, function(x)
          data.frame(x[[1]], rbind(cbind(AB = "A", x[[2]]), cbind(AB = "B", x[[3]])))))
rownames(DF2) <- NULL

输出 - 数据框长格式

> ident
     Date         Time       Subject   
[1,] "01/01/2016" "12:00:00" "Subject1"
[2,] "01/01/2016" "12:00:00" "2"       
> A
, , Subject1

  V2 V3 V4 V5
1  1  2  4  1
2  2  1  2  3
3  1  0  2  7

, , 2

  V2 V3 V4 V5
1  8  2  0  1
2  9  1  2  7
3  1  6  2  7

> B
, , Subject1

  V2 V3 V4 V5
1  2  3  0  1
2  4  1  1  2
3  3  5  2  8

, , 2

  V2 V3 V4 V5
1  2  3  2  0
2  6  7  1  2
3  3  3  2  4

使用R

2 个答案: