在R中将数据集重现为不同的格式

时间:2014-05-30 04:17:05

标签: r reshape

我有一个如下所示的数据集Data

dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L, 
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567", 
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL", 
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"), 
    IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN", 
    "BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN", 
"Values", "IND"), class = "data.frame", row.names = c(NA, -9L
))

enter image description here 我希望将上述数据集转换为以下格式的数据框(out_data)。 目前我的Data有3列 - 需要将这些列转换为以下格式的16列。 我需要重新设置我的输入 - 在屏幕截图中确切地给出数据框。 我无法改变以下结构 -

colnames(out_data) <- ("FN","H_BLK","S_N/R_N","B_N","FL_N","U_N","PC","XC","YC","BS","BRF","LCT_DEC","BRN","BO  PN","S_TY_CD")

enter image description here

inputnand中的Multiple values列始终采用以下格式:

  • |639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567 - |PC|H_BLK|S_N/R_N|XC|YC
  • |8121|B01|SOMERSET STN - &gt; |BS|BRF|LCT_DEC
  • |SN30|SMRT ------&gt; |BRN|BO

如果

IND =LOC - then |PC|H_BLK|S_N/R_N|XC|YC`  get updated with S_TY_CD=LOC
IND= BN - then B_N column should be updated with S_TY_CD=BN
IND= RN - then _N/R_N column should be updated with S_TY_CD=RN
IND= BS then `|BS|BRF|LCT_DEC` should be updated with S_TY_CD=BS
IND= BR then `|BRN|BO` should be updated with S_TY_CD=BR
IND= PN then PN with S_TY_CD=PN

是否有一种有效的方法。

1 个答案:

答案 0 :(得分:9)

这是一种转型方法。首先,我为各种子问题定义了一些辅助函数。

#define  out cols
outcols<-c("FN", "H_BLK", "S_N/R_N", "B_N", "FL_N", "U_N", "PC", 
    "XC", "YC", "BS", "BRF", "LCT_DEC", "BRN","BO","PN","S_TY_CD")

#identify parts for each compound value
namevals <- function(ind, vals) {
    names<-if (ind=="LOC") {
        c("PC","H_BLK","S_N/R_N","XC","YC")
    } else if (ind=="BN") {
        c("B_N")
    } else if (ind=="RN") {
        c("S_N/R_N")
    } else if (ind=="BS") {
        c("BS","BRF","LCT_DEC")
    } else if (ind=="BR") {
        c("BRN","BO")
    } else if (ind=="PN") {
        c("PN")
    }
    stopifnot(length(names)==length(vals))
    stopifnot(all(names %in% outcols))
    names(vals)<-names
    vals
}

#add missing values for row
fillrow <- function(nvals) {
    r<-rep(NA, length(outcols))
    r[match(names(nvals), outcols)]<-nvals
    r
}

现在,我使用mapply将这些应用于数据的每一行,以返回一个字符向量。在这里,我们确保拆分&#34;值&#34;管道上的柱子并移除引导管道。

#combine rows into character matrix
dt<-mapply(function(fn,vals,ind){   
    x<-c(FN=fn,namevals(ind, vals), "S_TY_CD"=ind)
    fillrow(x)
  }, 
  as.character(Data$FN), 
  strsplit(gsub("^\\|","",as.character(Data$Values)),"|", fixed=T), 
  as.character(Data$IND)
)

最后我们整理数据,以便将其写入write.table的文件。请注意,所有缺失值都是真正的R NA值。在write.table中,您可以设置na = "",如果您打算将其打印为空白值而不是默认值&#34; NA&#34;值。

#turn matrix into data.frame with proper names
dd<-data.frame(unname(t(dt)), stringsAsFactors=F)
names(dd)<-outcols
dd