我有一个如下所示的数据集Data
:
dput(Data)
structure(list(FN = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "20131202-0985 ", class = "factor"), Values = structure(c(1L,
8L, 7L, 6L, 5L, 9L, 2L, 4L, 3L), .Label = c("|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567",
"|8121|B01|SOMERSET STN", "|96942883", "|SN30|SMRT\n", "CENTRAL",
"FOUR SEASONS HOTEL", "HOTEL", "IKEA", "nanyang avenue"), class = "factor"),
IND = structure(c(4L, 1L, 1L, 1L, 1L, 6L, 3L, 2L, 5L), .Label = c("BN",
"BR", "BS", "LOC", "PN", "RN"), class = "factor")), .Names = c("FN",
"Values", "IND"), class = "data.frame", row.names = c(NA, -9L
))
我希望将上述数据集转换为以下格式的数据框(
out_data
)。
目前我的Data
有3列 - 需要将这些列转换为以下格式的16列。
我需要重新设置我的输入 - 在屏幕截图中确切地给出数据框。
我无法改变以下结构 -
colnames(out_data) <- ("FN","H_BLK","S_N/R_N","B_N","FL_N","U_N","PC","XC","YC","BS","BRF","LCT_DEC","BRN","BO PN","S_TY_CD")
inputnand中的Multiple values列始终采用以下格式:
|639778|21|NANYANG CIRCLE|103.686721631628|1.34640300329567
-
|PC|H_BLK|S_N/R_N|XC|YC
|8121|B01|SOMERSET STN
- &gt; |BS|BRF|LCT_DEC
|SN30|SMRT
------&gt; |BRN|BO
如果
IND =LOC - then |PC|H_BLK|S_N/R_N|XC|YC` get updated with S_TY_CD=LOC
IND= BN - then B_N column should be updated with S_TY_CD=BN
IND= RN - then _N/R_N column should be updated with S_TY_CD=RN
IND= BS then `|BS|BRF|LCT_DEC` should be updated with S_TY_CD=BS
IND= BR then `|BRN|BO` should be updated with S_TY_CD=BR
IND= PN then PN with S_TY_CD=PN
是否有一种有效的方法。
答案 0 :(得分:9)
这是一种转型方法。首先,我为各种子问题定义了一些辅助函数。
#define out cols
outcols<-c("FN", "H_BLK", "S_N/R_N", "B_N", "FL_N", "U_N", "PC",
"XC", "YC", "BS", "BRF", "LCT_DEC", "BRN","BO","PN","S_TY_CD")
#identify parts for each compound value
namevals <- function(ind, vals) {
names<-if (ind=="LOC") {
c("PC","H_BLK","S_N/R_N","XC","YC")
} else if (ind=="BN") {
c("B_N")
} else if (ind=="RN") {
c("S_N/R_N")
} else if (ind=="BS") {
c("BS","BRF","LCT_DEC")
} else if (ind=="BR") {
c("BRN","BO")
} else if (ind=="PN") {
c("PN")
}
stopifnot(length(names)==length(vals))
stopifnot(all(names %in% outcols))
names(vals)<-names
vals
}
#add missing values for row
fillrow <- function(nvals) {
r<-rep(NA, length(outcols))
r[match(names(nvals), outcols)]<-nvals
r
}
现在,我使用mapply
将这些应用于数据的每一行,以返回一个字符向量。在这里,我们确保拆分&#34;值&#34;管道上的柱子并移除引导管道。
#combine rows into character matrix
dt<-mapply(function(fn,vals,ind){
x<-c(FN=fn,namevals(ind, vals), "S_TY_CD"=ind)
fillrow(x)
},
as.character(Data$FN),
strsplit(gsub("^\\|","",as.character(Data$Values)),"|", fixed=T),
as.character(Data$IND)
)
最后我们整理数据,以便将其写入write.table
的文件。请注意,所有缺失值都是真正的R NA
值。在write.table
中,您可以设置na = ""
,如果您打算将其打印为空白值而不是默认值&#34; NA&#34;值。
#turn matrix into data.frame with proper names
dd<-data.frame(unname(t(dt)), stringsAsFactors=F)
names(dd)<-outcols
dd