如何将数据框中的列拆分为R中的多个列?

时间:2015-06-04 07:19:11

标签: r

我的数据框只有一个可变数据,但据说包含多个列的数据,例如ID,年,月等等都在一列中。

这是来自互联网的天气数据。它的组织非常糟糕。它没有列名,所有数据都只被推送到一列。

每一行表示字符1-11表示一个ID,接下来的四个字符表示一年,接下来的几个字符表示一月份的气候测量,然后是一些标志(空白或聊天),然后是二月份和再一些标志(空白或聊天),然后三月和它的旗帜(空白或聊天)..等等。详细的数据结构解释为here

USH00011084 1974  1628     1606     1363     1039     1343      903     2536      839     2048      358     1118      754   
USH00011084 1975  1714     1837     1544f    2828     1758     1898     4848     2110     2217     1197     1512     1445   
USH00011084 1976   825      541      989      600     2502     1448      971     1157      704      899     1340      856a  
USH00011084 1977  1319      528     2665      473      285     1590     2337     3733      961      434     1259      981   
USH00011084 1978  2722     1023     1574     1214     2919     2136     1548      988      875       46      917     1379   
USH00011084 1979  1927     2671     1285     1063      966     1160     2282     1120      979      292     1470      812   
USH00011084 1980  1639      368     3799     2005     1423     1826. 1   917      423     1449     1353     1039      287   
USH00011084 1981    38b    2846     1170      127     1334      995     2022     1343      467      413      513     1909   
USH00011084 1982  1631     3097      910     1127      879     1416     2103     1482     1060      551      863     1702   
USH00011084 1983  1207     2210     2604     1925      820     1714      662     1235     1204      394     1145     2219   

现在,我试图通过将数据分成相应的列来将数据组织到数据帧中。我尝试了很多东西但是失败了。

require(reshape2)

 colsplit(only_req_col, " ", c("ID", "Year","Jan", "J1", "Feb", "F1", "Mar", "M1", "Apr", "A1", "May", "My1","Jun","Jn1", "Jul", "Jl1", "Aug","A1","Sep", "S1","Oct", "O1","Nov", "N1", "Dec","D1" ))

没有工作!

require(tidyr)

separate(data = data_2, col = V1, into = c("ID", "Year","Jan", "J1", "Feb", "F1", "Mar", "M1", "Apr", "A1", "May", "My1","Jun","Jn1", "Jul", "Jl1", "Aug","A1","Sep", "S1","Oct", "O1","Nov", "N1", "Dec","D1" ), sep = "")

没有工作!

require(reshape2)

data_3 <- colsplit(gsub(pattern = "[0-9]"," ",data_2), 
          names= c("ID", "Year","Jan", "J1", "Feb", "F1", "Mar", "M1", "Apr", "A1", "May", "My1","Jun","Jn1", "Jul", "Jl1", "Aug","A1","Sep", "S1","Oct", "O1","Nov", "N1", "Dec","D1" ))

没有工作!

我想只保留每年和每月的每月数据,并忽略这些标志。

我最终使用substr为每个月的位置实现了这一目标,但它涉及大量人工计数以获得正确的月份位置:

ID <- substr(DATA_3[,1], 1,11)
YEAR= substr(DATA_3[,1], 13,16)
JAN = substr(DATA_3[,1], 18,22)
FEB = substr(DATA_3[,1], 27,31)
MAR = substr(DATA_3[,1], 36,40)
APR = substr(DATA_3[,1], 45,49)
MAY = substr(DATA_3[,1], 53,57)
JUN = substr(DATA_3[,1], 63,67)
JUL = substr(DATA_3[,1], 72,76)
AUG = substr(DATA_3[,1], 81,85)
SEP = substr(DATA_3[,1], 90,95)
OCT = substr(DATA_3[,1], 99,103)
NOV = substr(DATA_3[,1], 108,112)
DEC = substr(DATA_3[,1], 117,121)


prcp_data <- data.frame(ID, YEAR, JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC)

现在输出看起来很好:

> prcp_data
              ID YEAR   JAN   FEB   MAR   APR   MAY   JUN   JUL   AUG    SEP   OCT   NOV   DEC
1    USH00011084 1890 -9999 -9999 -9999 -9999  -999 -9999 -9999 -9999 -9999  -9999 -9999   432
2    USH00011084 1891  1397  1425  1461   419    69  1702  1080   437   508      0  2362  1333
3    USH00011084 1892  3162  1118   650   406    96  1981  3114  3442   762    254   419 -9999
4    USH00011084 1893  1359  1544  1181   965   185   876  1080  1638   876   1613  1237  1237
5    USH00011084 1894   610  4188  2002   572   118   673 -9999 -9999 -9999     76 -9999   191
6    USH00011084 1895 -9999   381  1016  1016   101  1016   762 -9999 -9999    508 -9999   762
7    USH00011084 1896  1118  3404  1499 -9999    81  2794  2375   470   356   1270   864   572
8    USH00011084 1897   622  3124  1207  1105    12    64  1867  2489     0      0 -9999 -9999
9    USH00011084 1900 -9999 -9999  1857  1788    57  3292  1989   993  1552   1646   488  1542
10   USH00011084 1926 -9999  1404  2619   905   127  1723  2149  2950  3477    884   823   900

任何比我更容易实现这一目标的更好解决方案?

0 个答案:

没有答案