我需要解析具有多个标题部分的ascii文件。模型代码段位于
之下Name1 | header1 | header2 | header3
header1| 11 | x1
Name2 | header1 | header2 | header3
header1| 2.5 | x2
header1| 3.7 | x3
header1| 4.2 | x4
Name3 | header1 | header2 | header3
header1| 34 | x5
header1| 37 | x6
etc.
我的任务是计算来自header1的数据的方差:
Names | Variances
-------------------------
Name1 | var(11) # =NA
Name2 | var(c(2.5,3.7,4.2))
Name3 | var(c(34,37))
etc.
如何处理R?
中的这类文件我的真实档案更复杂:
HD 4478 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or| Reference |
velocities |V | -23.00 5.20 |D ( )|s , ,O , | | | |1992A&AS...95..541F|
BD +41 43| velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or| Reference |
velocities |V | 18.40 7.40 |D ( )|s , ,O , | | | |2007AN....328..889K|
velocities |v | 18.4 |D ( 3)| , , , | |NN | |1979IAUS...30...57E|
velocities |v | 15.2 | ( 4)| , , , | | | |1970MmRAS..72..233H|
HIP 8855 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or| Reference |
velocities |V | -10.00 7.40 |D ( )|s , ,O , | | | |1999A&AS..137..451G|
HD 215441 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or| Reference |
velocities |v | -5.5 | ( 11)| , , , | | | |1969ApJ...156..967P|
velocities |v | | ( 18)| , , , | |V | |1960ApJ...132..521B|
HD 147010 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or| Reference |
velocities |V | -3.96 1.41 |B ( )|s , ,O , | | | |2012ApJ...745...56D|
velocities |V | -8.20 3.10 |C ( )|s , ,O , | | | |2006AstL...32..759G|
velocities |v | -9 |C ( 3)| , , , | |NN | |1953GCRV..C......0W|
velocities |v | -8.8 | ( 3)| , , , | | | |1950ApJ...111..221W|
期望的结果是:
Names | Variances
-------------------------
HD 4478 | var(-23.00) # =NA
BD +41 43| var(c(18.40,18.4,15.2))
HIP 8855 | var(-10.00) # =NA
HD 215441| var(-5.5) # =NA
HD 147010| var(c(-3.96,-8.20,-9,-8.8))
答案 0 :(得分:1)
主要问题是正确读取数据。也许这个格式是在某处指定的?但是,可以在几行内读取样本数据:
# read your ascii-file
asciitxt = readClipboard()
# find the headers (starting with "Name")
headers = which(grepl("^Name", asciitxt))
# split asciitext in groups
asciitxt = split(asciitxt, cumsum(seq_along(asciitxt) %in% headers))
# read asciitext as dataframe
l.in = lapply(asciitxt, function(x) read.table(text=x, header=T, sep="|", fill=T, stringsAsFactors=F))
# name the elements of your list
names(l.in) = sapply(l.in, function(x) names(x)[1])
# do your calculations
sapply(l.in, function(x) var(x$header1))
您的实际数据存在的问题是,计算所需的值不会在一个变量中分开。例如,在第2行中,变量“typ”不仅包含值“23.00”,还包含字符串“23.00 5.20”。在read.table之后你必须以某种方式潜入你的变量“typ”。看看包tidyr :: extract。