Question

在R中我试图导入一个具有以下结构的海量文本文件：这是一个保存为example.txt的示例：

Curve Name: 
     Curve A
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          95
     2               50          90
 Curve Color:
     Blue 

Curve Name: 
     Curve B
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          90
     2               50          80
 Curve Color:
     Green

到目前为止，我可以提取名称和颜色

file.text <- readLines("example.txt")

curve.names <- trimws(file.text[which(regexpr('Curve Name:', file.text) > 0) + 1])
curve.colors <- trimws(file.text[which(regexpr('Curve Color:', file.text) > 0) + 1])

如何创建一个数据框，其中curve.name为一个因子，其他值为以下结构中的数字？

curve.name   index   variable.1   variable.2 
   Curve A   0               30          100
   Curve A   1               40           95
   Curve A   2               50           90
   Curve B   0               30          100
   Curve B   1               40           90
   Curve B   2               50           80

Answer 1

假设每个文件都具有上述格式：

txt <- readLines("example.txt")
curve_name <- rep(trimws(txt[c(2,13)]), each=3)
curve_color <- rep(trimws(txt[c(10,21)]), each=3)
val <- read.table(text=paste(txt[c(6:8, 17:19)], collapse = "\n"))
colnames(val) <- c("index", "var1", "var2")
cbind(curve_name, curve_color, val)

如果格式不完全符合上述格式，您可以尝试通过标题来确定行索引。所以看看Curve Values:

的位置

给出了：

  curve_name curve_color index var1 var2
1    Curve A        Blue     0   30  100
2    Curve B        Blue     1   40   95
3    Curve A        Blue     2   50   90
4    Curve B       Green     0   30  100
5    Curve A       Green     1   40   90
6    Curve B       Green     2   50   80

Answer 2

将行读入L，删除Curve Color之前的所有空格。（如果实际文件中Curve Color之前没有空格，则删除空格可能没有必要，但问题是Curve Color之前有空格。）然后重新读取以数字创建开头的行variables data.frame。然后使用rest阅读read.dcf，并使用cbind将两者放在一起。

我们假设

曲线值排在第二位，因此我们可以使用rest

[, -2]

只有数字表中的行以数字开头（以空格开头）。
每个数字记录都有3列，问题中显示的是列名。行以索引号0开头，同一记录中的后续行也不具有0 index个号。（对每个数字表中的行数没有限制，不同的记录可能有不同数量的行。）

没有使用任何包裹。

L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE), 
 col.names = c("index", "variable.1", "variable.2"))
rest <- trimws(read.dcf(textConnection(L))[, -2])
cbind(rest[cumsum(variables$index == 0), ], variables)

，并提供：

  Curve Name Curve Color index variable.1 variable.2
1    Curve A        Blue     0         30        100
2    Curve A        Blue     1         40         95
3    Curve A        Blue     2         50         90
4    Curve B       Green     0         30        100
5    Curve B       Green     1         40         90
6    Curve B       Green     2         50         80

Answer 3

通常有很多grep。找到一种分组条目的方法，比如空白行的累积总和，也可以很方便：

l <- readLines(textConnection('Curve Name: 
     Curve A
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          95
     2               50          90
 Curve Color:
     Blue 

Curve Name: 
     Curve B
Curve Values:
     index   Variable 1   Variable 2
                   [°C]          [%]
     0               30          100
     1               40          90
     2               50          80
 Curve Color:
     Green '))

do.call(rbind, 
        lapply(split(trimws(l), cumsum(l == '')), function(x){
            data.frame(
                curve = x[grep('Curve Name:', x) + 1], 
                read.table(text = paste(x[(grep('index', x) + 2):(grep('Curve Color:', x) - 1)], 
                                        collapse = '\n'), 
                           col.names = c('index', 'variable.1', 'varible.2')))}))
##       curve index variable.1 varible.2
## 0.1 Curve A     0         30       100
## 0.2 Curve A     1         40        95
## 0.3 Curve A     2         50        90
## 1.1 Curve B     0         30       100
## 1.2 Curve B     1         40        90
## 1.3 Curve B     2         50        80

在R中，如何从分割数据的文本文件创建数据框？

3 个答案: