我是新的R用户。目前我正在研究一个数据集,其中我必须将多个二进制列转换为单因子列
以下是示例:
当前数据集如:
$ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
$ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
$ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
$ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
1 0 0 0
0 1 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
重新编码的列应为:
Property
1 Real estate
2 Insurance
3 Real estate
4 Insurance
5 CarOther
6 Unknown
它基本上与melt.matrix
函数相反。
感谢大家的宝贵意见。它确实有效。 但是有一个问题, 我有一些值为:
的行Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
0 0 0 0
我希望将这些标记为NA或Null
如果您也建议,也会有所帮助。
谢谢
答案 0 :(得分:2)
> mat <- matrix(c(0,1,0,0,0,
+ 1,0,0,0,0,
+ 0,0,0,1,0,
+ 0,0,1,0,0,
+ 0,0,0,0,1), ncol = 5, byrow = TRUE)
> colnames(mat) <- c("Level1","Level2","Level3","Level4","Level5")
> mat
Level1 Level2 Level3 Level4 Level5
[1,] 0 1 0 0 0
[2,] 1 0 0 0 0
[3,] 0 0 0 1 0
[4,] 0 0 1 0 0
[5,] 0 0 0 0 1
根据每行中每个1的索引创建一个新因子 使用矩阵列名称作为每个级别的标签
NewFactor <- factor(apply(mat, 1, function(x) which(x == 1)),
labels = colnames(mat))
> NewFactor
[1] Level2 Level1 Level4 Level3 Level5
Levels: Level1 Level2 Level3 Level4 Level5
你也可以尝试:
factor(mat%*%(1:ncol(mat)), labels = colnames(mat))
也使用Tomas
解决方案 - ifounf在某处SO
as.factor(colnames(mat)[mat %*% 1:ncol(mat)])
答案 1 :(得分:2)
融化无疑是一种解决方案。我建议使用reshape2熔化如下:
library(reshape2)
df=data.frame(Property.RealEstate=c(0,0,1,0,0,0),
Property.Insurance=c(0,1,0,1,0,0),
Property.CarOther=c(0,0,0,0,1,0),
Property.Unknown=c(0,0,0,0,0,1))
#add id column (presumably you have ids more meaningful than row numbers)
df$row=1:nrow(df)
#melt to "long" format
long=melt(df,id="row")
#only keep 1's
long=long[which(long$value==1),]
#merge in ids for NA entries
long=merge(df[,"row",drop=F],long,all.x=T)
#clean up to match example output
long=long[order(long$row),"variable",drop=F]
names(long)="Property"
long$Property=gsub("Property.","",long$Property,fixed=T)
#results
long
答案 2 :(得分:0)
不同的东西:
获取数据:
dat <- data.frame(Property.RealEstate=c(1,0,1,0,0,0),Property.Insurance=c(0,1,0,1,0,0),Property.CarOther=c(0,0,0,0,1,0),Property.Unknown=c(0,0,0,0,0,1))
重塑它:
names(dat)[row(t(dat))[t(dat)==1]]
#[1] "Property.RealEstate" "Property.Insurance" "Property.RealEstate"
#[4] "Property.Insurance" "Property.CarOther" "Property.Unknown"
如果您想要清理它,请执行以下操作:
gsub("Property\\.","",names(dat)[row(t(dat))[t(dat)==1]])
#[1] "RealEstate" "Insurance" "RealEstate" "Insurance" "CarOther" "Unknown"
如果您更喜欢因子输出:
factor(row(t(dat))[t(dat)==1],labels=names(dat))
......并清理完毕:
factor(row(t(dat))[t(dat)==1],labels=gsub("Property\\.","",names(dat)) )
答案 3 :(得分:0)
或者,你可以用天真的方式做到这一点。我认为它比任何其他建议(包括我的其他建议)更透明。
df=data.frame(Property.RealEstate=c(0,0,1,0,0,0),
Property.Insurance=c(0,1,0,1,0,0),
Property.CarOther=c(0,0,0,0,1,0),
Property.Unknown=c(0,0,0,0,0,1))
propcols=c("Property.RealEstate", "Property.Insurance", "Property.CarOther", "Property.Unknown")
df$Property=NA
for(colname in propcols)({
coldata=df[,colname]
df$Property[which(coldata==1)]=colname
})
df$Property=gsub("Property.","",df$Property,fixed=T)