编辑:这是尝试更好地传达我的问题并提供可重复的示例。有人请(1)解释我的方法有什么问题,(2)提出任何合理的解决方案吗?
我的数据如下:
DF.PSoft <- structure(list(Last = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("Carruthers",
"Fester", "Mauger", "Schofield", "Vanhoy"), class = "factor"),
Salary = structure(c(5L, 3L, 1L, 2L, 4L), .Label = c("121991.0",
"142403.0", "47305.0", "47740.0", "49172.0"), class = "factor"),
Dept1 = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("215086",
"221230"), class = "factor"), Distrib1 = structure(c(2L,
1L, 4L, 1L, 3L), .Label = c("100.0", "50.0", "75.0", "90.0"
), class = "factor"), Dept2 = structure(c(2L, 1L, 3L, 1L,
4L), .Label = c("", "026112", "215086", "221704"), class = "factor"),
Distrib2 = structure(c(3L, 1L, 4L, 1L, 2L), .Label = c("0.0",
"15.0", "40.0", "5.0"), class = "factor"), Dept3 = structure(c(3L,
1L, 3L, 1L, 2L), .Label = c("", "215086", "221704"), class = "factor"),
Distrib3 = structure(c(2L, 1L, 3L, 1L, 2L), .Label = c("0.0",
"10.0", "5.0"), class = "factor")), .Names = c("Last", "Salary",
"Dept1", "Distrib1", "Dept2", "Distrib2", "Dept3", "Distrib3"
), row.names = c(NA, -5L), class = "data.frame")
>DF.PSoft
Last Salary Dept1 Distrib1 Dept2 Distrib2 Dept3 Distrib3
# 1 Schofield 49172.0 221230 50.0 026112 40.0 221704 10.0
# 2 Mauger 47305.0 215086 100.0 0.0 0.0
# 3 Fester 121991.0 221230 90.0 215086 5.0 221704 5.0
# 4 Carruthers 142403.0 215086 100.0 0.0 0.0
# 5 Vanhoy 47740.0 221230 75.0 221704 15.0 215086 10.0
数据描述了在各个部门工作的人员。 Schofield花费50%的时间在221230部门,40%在026112,10%在221704。真正的数据集共有10个部门和Distrib列;在这个例子中,我正在使用3个Dept / Distrib列。
我想将数据重新整形为一个新框架,显示Last,Salary,包含215086的“Dept”列(如果有“Dept”列匹配),以及相应的“Distrib”列:
>DF.Desired
Last Salary Dept Distrib
# 1 Mauger 47305.0 215086 100.0
# 2 Fester 121991.0 215086 5.0
# 3 Carruthers 142403.0 215086 100.0
# 4 Vanhoy 47740.0 215086 10.0
我该怎么做?我一直在努力解决这个问题。这是我到目前为止所拥有的。
显示数据==“215086”的所有索引。
test <- which(DF.PSoft=="215086", arr.in=TRUE)
>test
row col
# [1,] 2 3
# [2,] 4 3
# [3,] 3 5
# [4,] 5 7
创建一个空DF,用于保存我接下来要执行的数据。
DF.blank <- data.frame(Last=character(dim(test)[1]), Salary=character(dim(test)[1]), Dept=character(dim(test)[1]), Distrib=character(dim(test)[1]), stringsAsFactors=FALSE)
>DF.blank
Last Salary Dept Distrib
# 1
# 2
# 3
# 4
用我想要的数据填充空DF。使用来自'test'的索引对整个数据集(DF.PSoft)进行子集,搜索所有条目==“215086”。获取'test'中列出的DF.PSoft行号,获取DF.PSoft列1,2,无论哪个列包含“215086”(这从DF.PSoft的相应“Dept”列中输入“215086”),该肯定搜索结果右侧的列(提取适当的Distrib)。如果我正确地考虑这个问题,无论我在文件中有多少“Dept”或“Distrib”列,这个方法都会有效。我想保留这种能力。
for(i in 1:dim(test)[1]){
DF.blank[i,] <- DF.PSoft[test[i,1], c(1,2, test[i,2], test[i,2]+1)]
}
令人愤怒的是,我得到了这个结果:
>DF.blank
Last Salary Dept Distrib
# 1 3 3 1 1
# 2 1 2 1 1
# 3 2 1 3 4
# 4 5 4 2 2
有趣的是,打印DF.PSoft子集似乎按预期工作:
for (i in 1:dim(test)[1]) {
print(DF.PSoft[test[i, 1], c(1, 2, test[i, 2], test[i, 2] + 1)])
}
# Last Salary Dept1 Distrib1
# 2 Mauger 47305.0 215086 100.0
# Last Salary Dept1 Distrib1
# 4 Carruthers 142403.0 215086 100.0
# Last Salary Dept2 Distrib2
# 3 Fester 121991.0 215086 5.0
# Last Salary Dept3 Distrib3
# 5 Vanhoy 47740.0 215086 10.0
非常感谢您的建议,并且再次抱歉,我开始时遇到了一个混乱的问题。
答案 0 :(得分:1)
如果我正确理解您的问题,您可以使用reshape
最基本的形式:
reshape(DF.PSoft, idvar=c("Last", "Salary"),
varying = 3:ncol(DF.PSoft), sep = "", direction = "long")
# Last Salary time Dept Distrib
# Schofield.49172.1 Schofield 49172 1 221230 50
# Mauger.47305.1 Mauger 47305 1 215086 100
# Fester.121991.1 Fester 121991 1 221230 90
# Carruthers.142403.1 Carruthers 142403 1 215086 100
# Vanhoy.47740.1 Vanhoy 47740 1 221230 75
# Schofield.49172.2 Schofield 49172 2 26112 40
# Mauger.47305.2 Mauger 47305 2 NA 0
# Fester.121991.2 Fester 121991 2 215086 5
# Carruthers.142403.2 Carruthers 142403 2 NA 0
# Vanhoy.47740.2 Vanhoy 47740 2 221704 15
# Schofield.49172.3 Schofield 49172 3 221704 10
# Mauger.47305.3 Mauger 47305 3 NA 0
# Fester.121991.3 Fester 121991 3 221704 5
# Carruthers.142403.3 Carruthers 142403 3 NA 0
# Vanhoy.47740.3 Vanhoy 47740 3 215086 10
如果需要,您可以稍后放弃rownames
。
我没有注意到你只想要一个部门。既然如此,试试这个:
out <- reshape(DF.PSoft, idvar=c("Last", "Salary"),
varying = 3:ncol(DF.PSoft), sep = "", direction = "long")
rownames(out) <- NULL
out[out$Dept == "215086", ]
# Last Salary time Dept Distrib
# 2 Mauger 47305.0 1 215086 100.0
# 4 Carruthers 142403.0 1 215086 100.0
# 8 Fester 121991.0 2 215086 5.0
# 15 Vanhoy 47740.0 3 215086 10.0
答案 1 :(得分:0)
来自rbindlist
包的简单data.table
。有了可重现的数据,我可以给出可重复的答案:)
res <- rbindlist(list(DF.PSoft[, c('Last', 'Salary', 'Dept1', 'Distrib1')],
DF.PSoft[, c('Last', 'Salary', 'Dept2', 'Distrib2')],
DF.PSoft[, c('Last', 'Salary', 'Dept3', 'Distrib3')]
))