我正在尝试分析一系列csv文件。文件具有这种结构。
"","x"
"1","<HTML>"
"2","<TITLE>Radiosonde Data</TITLE>"
"3","<LINK REL=""StyleSheet"" HREF=""/resources/select.css"" TYPE=""text/css"">"
"4","<BODY BGCOLOR=""white"">"
"5","<H2>15420 LRBS Bucuresti Observations at 00Z 01 Jan 1980</H2>"
"6","<PRE>"
"7","-----------------------------------------------------------------------------"
"8"," PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV"
"9"," hPa m C C % g/kg deg knot K K K "
"10","-----------------------------------------------------------------------------"
"11"," 1000.0 -55 "
"12"," 981.0 91 1.4 1.4 100 4.34 45 26 276.1 288.0 276.8"
"13"," 850.0 1257 3.6 3.6 100 5.86 155 28 289.9 306.9 290.9"
"14"," 700.0 2809 -5.1 -5.1 100 3.76 170 44 296.8 308.2 297.5"
"15"," 500.0 5350 -24.9 -25.7 93 0.95 180 36 302.6 305.8 302.8"
"16"," 400.0 6940 -36.0 -36.9 91 0.41 195 36 308.1 309.6 308.2"
"17"," 300.0 8870 -51.5 -52.8 86 0.10 200 46 312.6 313.0 312.7"
"18"," 250.0 10050 -60.3 -61.7 83 0.04 205 66 316.3 316.4 316.3"
"19"," 238.0 10345 -61.5 -62.9 83 0.03 200 80 319.0 319.1 319.0"
"20"," 200.0 11390 -65.7 -67.3 80 0.02 215 60 328.6 328.7 328.6"
"21"," 150.0 13170 -60.3 -61.9 81 0.06 225 56 366.0 366.3 366.0"
"22"," 100.0 15680 -62.1 -63.8 80 0.07 235 48 407.5 407.9 407.5""
"23","</PRE><H3>Station info</H3><PRE>"
"24"," Station ID: LRBS"
目的是将嵌入表(上例中的第11-22行之间的数据)作为具有11列的单独数据帧提取。到目前为止,我已经编写了一个脚本来读取所有文本行(使用readLines),然后识别每个表的开始和结束。但通过这种方式,数据操作并不容易。我想知道提取表格的更有效方法是什么。
非常感谢你的帮助。
答案 0 :(得分:1)
给定一个名为ick.csv
的文件格式:
读入两列 - 忽略所有引号:
txt = read.table("ick.csv",head=FALSE,quote='',sep=',',skip=1, as.is=TRUE)
现在数据从哪里开始?在第二列中查找负载的负载:
headlines = grep("------",txt$V2)
应该有两个。标题标签位于第一个之后的行上,数据在第二个之后开始,我们希望在结束之前最多两行。如果你的文件最后还有其他内容,你可能需要调整一下:
txt = txt[c(headlines[1]+1, (headlines[2]+1):(nrow(txt)-2)),]
将名称转换为矢量:
names = scan(textConnection(gsub('"',"",txt[1,2])),what="")
现在将文本粘贴到一个大文本字符串中,并删除引号:
txt = paste0(txt$V2,sep="",collapse="\n")
txt = gsub('"','',txt)
现在将其视为固定宽度的数据,并将其命名为:
d = read.fwf(textConnection(txt),widths=rep(7,11),skip=2)
names(d) = names
多田!
> d
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
1 981 91 1.4 1.4 100 4.34 45 26 276.1 288.0 276.8
2 850 1257 3.6 3.6 100 5.86 155 28 289.9 306.9 290.9
3 700 2809 -5.1 -5.1 100 3.76 170 44 296.8 308.2 297.5
4 500 5350 -24.9 -25.7 93 0.95 180 36 302.6 305.8 302.8
5 400 6940 -36.0 -36.9 91 0.41 195 36 308.1 309.6 308.2
6 300 8870 -51.5 -52.8 86 0.10 200 46 312.6 313.0 312.7
7 250 10050 -60.3 -61.7 83 0.04 205 66 316.3 316.4 316.3
8 238 10345 -61.5 -62.9 83 0.03 200 80 319.0 319.1 319.0
9 200 11390 -65.7 -67.3 80 0.02 215 60 328.6 328.7 328.6
10 150 13170 -60.3 -61.9 81 0.06 225 56 366.0 366.3 366.0
11 100 15680 -62.1 -63.8 80 0.07 235 48 407.5 407.9 407.5