我的数据框只有一个可变数据,但据说包含多个列的数据,例如ID,年,月等等都在一列中。
这是来自互联网的天气数据。它的组织非常糟糕。它没有列名,所有数据都只被推送到一列。
每一行表示字符1-11表示一个ID,接下来的四个字符表示一年,接下来的几个字符表示一月份的气候测量,然后是一些标志(空白或聊天),然后是二月份和再一些标志(空白或聊天),然后三月和它的旗帜(空白或聊天)..等等。详细的数据结构解释为here
USH00011084 1974 1628 1606 1363 1039 1343 903 2536 839 2048 358 1118 754
USH00011084 1975 1714 1837 1544f 2828 1758 1898 4848 2110 2217 1197 1512 1445
USH00011084 1976 825 541 989 600 2502 1448 971 1157 704 899 1340 856a
USH00011084 1977 1319 528 2665 473 285 1590 2337 3733 961 434 1259 981
USH00011084 1978 2722 1023 1574 1214 2919 2136 1548 988 875 46 917 1379
USH00011084 1979 1927 2671 1285 1063 966 1160 2282 1120 979 292 1470 812
USH00011084 1980 1639 368 3799 2005 1423 1826. 1 917 423 1449 1353 1039 287
USH00011084 1981 38b 2846 1170 127 1334 995 2022 1343 467 413 513 1909
USH00011084 1982 1631 3097 910 1127 879 1416 2103 1482 1060 551 863 1702
USH00011084 1983 1207 2210 2604 1925 820 1714 662 1235 1204 394 1145 2219
现在,我试图通过将数据分成相应的列来将数据组织到数据帧中。我尝试了很多东西但是失败了。
require(reshape2)
colsplit(only_req_col, " ", c("ID", "Year","Jan", "J1", "Feb", "F1", "Mar", "M1", "Apr", "A1", "May", "My1","Jun","Jn1", "Jul", "Jl1", "Aug","A1","Sep", "S1","Oct", "O1","Nov", "N1", "Dec","D1" ))
没有工作!
require(tidyr)
separate(data = data_2, col = V1, into = c("ID", "Year","Jan", "J1", "Feb", "F1", "Mar", "M1", "Apr", "A1", "May", "My1","Jun","Jn1", "Jul", "Jl1", "Aug","A1","Sep", "S1","Oct", "O1","Nov", "N1", "Dec","D1" ), sep = "")
没有工作!
require(reshape2)
data_3 <- colsplit(gsub(pattern = "[0-9]"," ",data_2),
names= c("ID", "Year","Jan", "J1", "Feb", "F1", "Mar", "M1", "Apr", "A1", "May", "My1","Jun","Jn1", "Jul", "Jl1", "Aug","A1","Sep", "S1","Oct", "O1","Nov", "N1", "Dec","D1" ))
没有工作!
我想只保留每年和每月的每月数据,并忽略这些标志。
我最终使用substr
为每个月的位置实现了这一目标,但它涉及大量人工计数以获得正确的月份位置:
ID <- substr(DATA_3[,1], 1,11)
YEAR= substr(DATA_3[,1], 13,16)
JAN = substr(DATA_3[,1], 18,22)
FEB = substr(DATA_3[,1], 27,31)
MAR = substr(DATA_3[,1], 36,40)
APR = substr(DATA_3[,1], 45,49)
MAY = substr(DATA_3[,1], 53,57)
JUN = substr(DATA_3[,1], 63,67)
JUL = substr(DATA_3[,1], 72,76)
AUG = substr(DATA_3[,1], 81,85)
SEP = substr(DATA_3[,1], 90,95)
OCT = substr(DATA_3[,1], 99,103)
NOV = substr(DATA_3[,1], 108,112)
DEC = substr(DATA_3[,1], 117,121)
prcp_data <- data.frame(ID, YEAR, JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC)
现在输出看起来很好:
> prcp_data
ID YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1 USH00011084 1890 -9999 -9999 -9999 -9999 -999 -9999 -9999 -9999 -9999 -9999 -9999 432
2 USH00011084 1891 1397 1425 1461 419 69 1702 1080 437 508 0 2362 1333
3 USH00011084 1892 3162 1118 650 406 96 1981 3114 3442 762 254 419 -9999
4 USH00011084 1893 1359 1544 1181 965 185 876 1080 1638 876 1613 1237 1237
5 USH00011084 1894 610 4188 2002 572 118 673 -9999 -9999 -9999 76 -9999 191
6 USH00011084 1895 -9999 381 1016 1016 101 1016 762 -9999 -9999 508 -9999 762
7 USH00011084 1896 1118 3404 1499 -9999 81 2794 2375 470 356 1270 864 572
8 USH00011084 1897 622 3124 1207 1105 12 64 1867 2489 0 0 -9999 -9999
9 USH00011084 1900 -9999 -9999 1857 1788 57 3292 1989 993 1552 1646 488 1542
10 USH00011084 1926 -9999 1404 2619 905 127 1723 2149 2950 3477 884 823 900
任何比我更容易实现这一目标的更好解决方案?