我需要编写一些代码来处理制表符分隔的文本文件,这些文件定义了R中具有可变长度的段。
如果该组是控件,则每个段中包含的表应在“ Group:”下方开始1行,在“〜End”上方停止2行,如果该组为标准,则在“〜End”上方停止6行。表本身的长度将是可变的,并且可以为空,如段“ SpikedControl”。
示例文件如下:
Group: Controls 1
Sample Wells Values MeanValue CV Od-bkgd-blank
Anti-Hu Det A11 2.849 2.855 0.282 2.853
A23 2.860
Coat Control A12 0.161 0.160 0.530 0.159
A24 0.160
Diluent Standard 1 A9 0.114 0.113 1.379 0.104
A21 0.112
Diluent Standard 2 A8 0.012 0.013 2.817 0.012
A20 0.013
~End
Group: SpikedControl 1
Sample Wells Concentration Values MeanValue CV ODbkgdblank Conc %Expected
~End
Group: Standards 1
Sample ExpConc Wells OD CV OD ODblank MeanODBlank Result %Recovery
St001 2000.000 B1 2.939 1.4 2.932 2.904 Range? Range?
B13 2.882 2.875 1153.779 57.689
St002 666.667 B2 2.820 0.9 2.812 2.829 456.435 68.465
B14 2.855 2.847 670.358 100.554
St003 222.222 B3 2.677 0.9 2.669 2.686 208.849 93.982
B15 2.709 2.702 237.852 107.033
St004 74.074 B4 2.215 1.4 2.205 2.226 72.185 97.449
B16 2.258 2.248 77.452 104.560
St005 24.691 B5 1.406 1.3 1.397 1.410 24.296 98.397
B17 1.433 1.424 25.153 101.868
St006 8.230 B6 0.669 2.9 0.658 0.672 7.781 94.536
B18 0.697 0.686 8.240 100.115
St007 2.743 B7 0.357 5.8 0.348 0.334 3.143 114.579
B19 0.329 0.320 2.759 100.579
St008 0.914 B8 0.198 3.7 0.191 0.186 1.029 112.551
B20 0.188 0.181 0.895 97.891
St009 0.305 B9 0.163 7.8 0.154 0.146 0.532 174.477
B21 0.146 0.137 0.296 97.190
St010 0.102 B10 0.130 5.1 0.123 0.119 0.096 94.087
B22 0.121 0.114 Range? Range?
St011 0.034 B11 0.133 4.7 0.126 0.122 0.134 394.778
B23 0.125 0.117 Range? Range?
St012 0.011 B12 0.117 0.7 0.105 0.104 Range? Range?
B24 0.115 0.104 Range? Range?
EC50 = 28.085
AUC = 5565.432
~End
我对处理像这样的文本文件不是很有经验,并且正在寻找有关如何识别这些段并读取其中的表的一些建议。
谢谢!
编辑-链接到示例文件:
https://www.dropbox.com/s/4m0lmbbequmpd9b/ExampleFile.txt?dl=0
PS:这些文件是从分光光度计吐出的,所以我对该格式没有任何控制权,因为该软件相当陈旧。
编辑2-取得一些进展:
读入文件并获取每个段的开始和结束行
inputtext <- readLines("ExampleFile.txt")
starts <- grep("Group:", inputtext)
ends <- grep("~End", inputtext)
我可以跟进
test2 <- read.table("ExampleFile.txt", header = T, sep = "\t", skip = 17, nrows = 24, blank.lines.skip = F)
现在我只是想弄清楚如何准确识别要读取的行数。
因此,如果开始行是17,结束行是48。则nrows需要到24,即48(指示结束行)-17(跳过的第一行)-7(以解决标题行)和表格末尾的绒毛线,如果是对照表,则也可以是4)
现在,我只需要弄清楚如何循环使用此方法,并正确地确定该组是对照组还是标准组即可减去正确的绒毛量。
答案 0 :(得分:1)
我最终做了以下事情:
library(tidyverse)
inputtext <- readLines("Test.txt")
starts <- grep("Group:", inputtext)
ends <- grep("~End", inputtext)
realend <- (ends - starts) - 2
dff <- list()
for (i in 1:4) {
dff[[i]] <- read.table("Test.txt",
header = T,
sep = "\t",
skip = starts[i],
nrows = realend[i],
blank.lines.skip = F,
row.names = NULL)
}
dff <- lapply(dff, function(x) {x[!is.na(x$Values),]})
dff <- dff[sapply(dff, function(x) dim(x)[1]) > 0]
names(dff) <- letters[1:length(dff)]
list2env(dff,.GlobalEnv)
答案 1 :(得分:0)
您可以为此目的使用嵌套数据框。我通过以下3个步骤详细介绍了该方法。
library(tidyverse)
inputtext <- readLines("~/downloads/ExampleFile.txt")
dtf1 <- data_frame(input = inputtext) %>%
separate(input, c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"), sep = "\t") %>%
mutate(group = ifelse(grepl("Group:", x1), x2, NA)) %>%
fill(group) %>%
filter(!is.na(x4))
head(dtf1)
dtf2 <- dtf1 %>%
group_by(group) %>%
nest()
dtf2$data[[1]]
将列名称从数据传递到第一个嵌套数据框
colnames1 <- dtf2$data[[1]] %>% slice(1) %>% unlist()
colnames1[is.na(colnames1)] <- names(colnames1[is.na(colnames1)])
colnames(dtf2$data[[1]]) <- colnames1
dtf3 <- dtf2 %>%
mutate(names = map(data, slice, 1),
names = map(names, unlist),
names = map(names,
function(x){ # Replace NA column names by the default x_ names
x[is.na(x)] <- names(x[is.na(x)])
return(x)
}),
data = map2(data, names, setNames),
data = map(data, slice, -1))
您现在有了一个数据帧列表。您可以使用组名来调用相应的数据框:
> dtf3$data[dtf3$group=="Controls"][[1]]
# A tibble: 8 x 9
Sample Wells Values MeanValue CV `Od-bkgd-blank` x7 x8 x9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Anti-Hu Det A11 2.849 2.855 0.282 2.853 NA NA NA
2 "" A23 2.860 "" "" "" NA NA NA
3 Coat Control A12 0.161 0.160 0.530 0.159 NA NA NA
4 "" A24 0.160 "" "" "" NA NA NA
5 Diluent Standard 1 A9 0.114 0.113 1.379 0.104 NA NA NA
6 "" A21 0.112 "" "" "" NA NA NA
7 Diluent Standard 2 A8 0.012 0.013 2.817 0.012 NA NA NA
8 "" A20 0.013 "" "" "" NA NA NA
> dtf3$data[dtf3$group=="Standards"][[1]]
# A tibble: 24 x 9
Sample ExpConc Wells OD `CV OD` ODblank MeanODBlank Result `%Recovery`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 St001 2000.000 B1 2.939 1.4 2.932 2.904 Range? Range?
2 "" "" B13 2.882 "" 2.875 "" 1153.779 57.689
3 St002 666.667 B2 2.820 0.9 2.812 2.829 456.435 68.465
4 "" "" B14 2.855 "" 2.847 "" 670.358 100.554
5 St003 222.222 B3 2.677 0.9 2.669 2.686 208.849 93.982
6 "" "" B15 2.709 "" 2.702 "" 237.852 107.033
7 St004 74.074 B4 2.215 1.4 2.205 2.226 72.185 97.449
8 "" "" B16 2.258 "" 2.248 "" 77.452 104.560
9 St005 24.691 B5 1.406 1.3 1.397 1.410 24.296 98.397
10 "" "" B17 1.433 "" 1.424 "" 25.153 101.868
# ... with 14 more rows
>
注意:根据this answer重命名。
将源数据作为要点放置,以防万一丢失: spectrophotometer.txt