提取嵌入式表格

时间:2016-01-21 19:09:08

标签: r

我正在尝试分析一系列csv文件。文件具有这种结构。

"","x"
"1","<HTML>"
"2","<TITLE>Radiosonde Data</TITLE>"
"3","<LINK REL=""StyleSheet"" HREF=""/resources/select.css"" TYPE=""text/css"">"
"4","<BODY BGCOLOR=""white"">"
"5","<H2>15420 LRBS Bucuresti Observations at 00Z 01 Jan 1980</H2>"
"6","<PRE>"
"7","-----------------------------------------------------------------------------"
"8","   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV"
"9","    hPa     m      C      C      %    g/kg    deg   knot     K      K      K "
"10","-----------------------------------------------------------------------------"
"11"," 1000.0    -55                                                               "
"12","  981.0     91    1.4    1.4    100   4.34     45     26  276.1  288.0  276.8"
"13","  850.0   1257    3.6    3.6    100   5.86    155     28  289.9  306.9  290.9"
"14","  700.0   2809   -5.1   -5.1    100   3.76    170     44  296.8  308.2  297.5"
"15","  500.0   5350  -24.9  -25.7     93   0.95    180     36  302.6  305.8  302.8"
"16","  400.0   6940  -36.0  -36.9     91   0.41    195     36  308.1  309.6  308.2"
"17","  300.0   8870  -51.5  -52.8     86   0.10    200     46  312.6  313.0  312.7"
"18","  250.0  10050  -60.3  -61.7     83   0.04    205     66  316.3  316.4  316.3"
"19","  238.0  10345  -61.5  -62.9     83   0.03    200     80  319.0  319.1  319.0"
"20","  200.0  11390  -65.7  -67.3     80   0.02    215     60  328.6  328.7  328.6"
"21","  150.0  13170  -60.3  -61.9     81   0.06    225     56  366.0  366.3  366.0"
"22","  100.0  15680  -62.1  -63.8     80   0.07    235     48  407.5  407.9  407.5""
"23","</PRE><H3>Station info</H3><PRE>"
"24","                         Station ID: LRBS"

目的是将嵌入表(上例中的第11-22行之间的数据)作为具有11列的单独数据帧提取。到目前为止,我已经编写了一个脚本来读取所有文本行(使用readLines),然后识别每个表的开始和结束。但通过这种方式,数据操作并不容易。我想知道提取表格的更有效方法是什么。

非常感谢你的帮助。

1 个答案:

答案 0 :(得分:1)

给定一个名为ick.csv的文件格式:

读入两列 - 忽略所有引号:

txt = read.table("ick.csv",head=FALSE,quote='',sep=',',skip=1, as.is=TRUE)

现在数据从哪里开始?在第二列中查找负载的负载:

headlines = grep("------",txt$V2)

应该有两个。标题标签位于第一个之后的行上,数据在第二个之后开始,我们希望在结束之前最多两行。如果你的文件最后还有其他内容,你可能需要调整一下:

txt = txt[c(headlines[1]+1, (headlines[2]+1):(nrow(txt)-2)),]

将名称转换为矢量:

names = scan(textConnection(gsub('"',"",txt[1,2])),what="")

现在将文本粘贴到一个大文本字符串中,并删除引号:

txt = paste0(txt$V2,sep="",collapse="\n")
txt = gsub('"','',txt)

现在将其视为固定宽度的数据,并将其命名为:

d = read.fwf(textConnection(txt),widths=rep(7,11),skip=2)
names(d) = names

多田!

> d
   PRES  HGHT  TEMP  DWPT RELH MIXR DRCT SKNT  THTA  THTE  THTV
1   981    91   1.4   1.4  100 4.34   45   26 276.1 288.0 276.8
2   850  1257   3.6   3.6  100 5.86  155   28 289.9 306.9 290.9
3   700  2809  -5.1  -5.1  100 3.76  170   44 296.8 308.2 297.5
4   500  5350 -24.9 -25.7   93 0.95  180   36 302.6 305.8 302.8
5   400  6940 -36.0 -36.9   91 0.41  195   36 308.1 309.6 308.2
6   300  8870 -51.5 -52.8   86 0.10  200   46 312.6 313.0 312.7
7   250 10050 -60.3 -61.7   83 0.04  205   66 316.3 316.4 316.3
8   238 10345 -61.5 -62.9   83 0.03  200   80 319.0 319.1 319.0
9   200 11390 -65.7 -67.3   80 0.02  215   60 328.6 328.7 328.6
10  150 13170 -60.3 -61.9   81 0.06  225   56 366.0 366.3 366.0
11  100 15680 -62.1 -63.8   80 0.07  235   48 407.5 407.9 407.5