重塑数据框架。 for loop&的麻烦将数据分配给新帧的行

时间:2013-11-18 03:41:14

标签: r dataframe reshape

编辑:这是尝试更好地传达我的问题并提供可重复的示例。有人请(1)解释我的方法有什么问题,(2)提出任何合理的解决方案吗?

我的数据如下:

DF.PSoft <- structure(list(Last = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("Carruthers", 
"Fester", "Mauger", "Schofield", "Vanhoy"), class = "factor"), 
Salary = structure(c(5L, 3L, 1L, 2L, 4L), .Label = c("121991.0", 
"142403.0", "47305.0", "47740.0", "49172.0"), class = "factor"), 
Dept1 = structure(c(2L, 1L, 2L, 1L, 2L), .Label = c("215086", 
"221230"), class = "factor"), Distrib1 = structure(c(2L, 
1L, 4L, 1L, 3L), .Label = c("100.0", "50.0", "75.0", "90.0"
), class = "factor"), Dept2 = structure(c(2L, 1L, 3L, 1L, 
4L), .Label = c("", "026112", "215086", "221704"), class = "factor"), 
Distrib2 = structure(c(3L, 1L, 4L, 1L, 2L), .Label = c("0.0", 
"15.0", "40.0", "5.0"), class = "factor"), Dept3 = structure(c(3L, 
1L, 3L, 1L, 2L), .Label = c("", "215086", "221704"), class = "factor"), 
Distrib3 = structure(c(2L, 1L, 3L, 1L, 2L), .Label = c("0.0", 
"10.0", "5.0"), class = "factor")), .Names = c("Last", "Salary", 
"Dept1", "Distrib1", "Dept2", "Distrib2", "Dept3", "Distrib3"
), row.names = c(NA, -5L), class = "data.frame")  

>DF.PSoft
          Last   Salary  Dept1 Distrib1  Dept2 Distrib2  Dept3 Distrib3
# 1  Schofield  49172.0 221230     50.0 026112     40.0 221704     10.0
# 2     Mauger  47305.0 215086    100.0             0.0             0.0
# 3     Fester 121991.0 221230     90.0 215086      5.0 221704      5.0
# 4 Carruthers 142403.0 215086    100.0             0.0             0.0
# 5     Vanhoy  47740.0 221230     75.0 221704     15.0 215086     10.0

数据描述了在各个部门工作的人员。 Schofield花费50%的时间在221230部门,40%在026112,10%在221704。真正的数据集共有10个部门和Distrib列;在这个例子中,我正在使用3个Dept / Distrib列。

我想将数据重新整形为一个新框架,显示Last,Salary,包含215086的“Dept”列(如果有“Dept”列匹配),以及相应的“Distrib”列:

>DF.Desired
          Last   Salary    Dept  Distrib  
# 1     Mauger  47305.0  215086    100.0
# 2     Fester 121991.0  215086      5.0 
# 3 Carruthers 142403.0  215086    100.0
# 4     Vanhoy  47740.0  215086     10.0

我该怎么做?我一直在努力解决这个问题。这是我到目前为止所拥有的。

  1. 显示数据==“215086”的所有索引。

    test <- which(DF.PSoft=="215086", arr.in=TRUE)
    
    >test
           row col
    # [1,]   2   3
    # [2,]   4   3
    # [3,]   3   5
    # [4,]   5   7
    
  2. 创建一个空DF,用于保存我接下来要执行的数据。

    DF.blank <- data.frame(Last=character(dim(test)[1]), Salary=character(dim(test)[1]), Dept=character(dim(test)[1]), Distrib=character(dim(test)[1]),  stringsAsFactors=FALSE)
    
    >DF.blank
         Last Salary Dept Distrib
    # 1                         
    # 2                         
    # 3                          
    # 4
    
  3. 用我想要的数据填充空DF。使用来自'test'的索引对整个数据集(DF.PSoft)进行子集,搜索所有条目==“215086”。获取'test'中列出的DF.PSoft行号,获取DF.PSoft列1,2,无论哪个列包含“215086”(这从DF.PSoft的相应“Dept”列中输入“215086”),该肯定搜索结果右侧的列(提取适当的Distrib)。如果我正确地考虑这个问题,无论我在文件中有多少“Dept”或“Distrib”列,这个方法都会有效。我想保留这种能力。

    for(i in 1:dim(test)[1]){
      DF.blank[i,] <- DF.PSoft[test[i,1], c(1,2, test[i,2], test[i,2]+1)]
      }
    
  4. 令人愤怒的是,我得到了这个结果:

    >DF.blank
        Last Salary Dept Distrib
    # 1    3      3    1       1
    # 2    1      2    1       1
    # 3    2      1    3       4
    # 4    5      4    2       2
    
  5. 有趣的是,打印DF.PSoft子集似乎按预期工作:

    for (i in 1:dim(test)[1]) {
        print(DF.PSoft[test[i, 1], c(1, 2, test[i, 2], test[i, 2] + 1)])
    }
    
    #     Last  Salary  Dept1 Distrib1
    # 2 Mauger 47305.0 215086    100.0
    #         Last   Salary  Dept1 Distrib1
    # 4 Carruthers 142403.0 215086    100.0
    #     Last   Salary  Dept2 Distrib2
    # 3 Fester 121991.0 215086      5.0
    #     Last  Salary  Dept3 Distrib3
    # 5 Vanhoy 47740.0 215086     10.0
    
  6. 非常感谢您的建议,并且再次抱歉,我开始时遇到了一个混乱的问题。

2 个答案:

答案 0 :(得分:1)

如果我正确理解您的问题,您可以使用reshape最基本的形式:

reshape(DF.PSoft, idvar=c("Last", "Salary"), 
        varying = 3:ncol(DF.PSoft), sep = "", direction = "long")
#                           Last Salary time   Dept Distrib
# Schofield.49172.1    Schofield  49172    1 221230      50
# Mauger.47305.1          Mauger  47305    1 215086     100
# Fester.121991.1         Fester 121991    1 221230      90
# Carruthers.142403.1 Carruthers 142403    1 215086     100
# Vanhoy.47740.1          Vanhoy  47740    1 221230      75
# Schofield.49172.2    Schofield  49172    2  26112      40
# Mauger.47305.2          Mauger  47305    2     NA       0
# Fester.121991.2         Fester 121991    2 215086       5
# Carruthers.142403.2 Carruthers 142403    2     NA       0
# Vanhoy.47740.2          Vanhoy  47740    2 221704      15
# Schofield.49172.3    Schofield  49172    3 221704      10
# Mauger.47305.3          Mauger  47305    3     NA       0
# Fester.121991.3         Fester 121991    3 221704       5
# Carruthers.142403.3 Carruthers 142403    3     NA       0
# Vanhoy.47740.3          Vanhoy  47740    3 215086      10

如果需要,您可以稍后放弃rownames


我没有注意到你只想要一个部门。既然如此,试试这个:

out <- reshape(DF.PSoft, idvar=c("Last", "Salary"),
               varying = 3:ncol(DF.PSoft), sep = "", direction = "long")
rownames(out) <- NULL
out[out$Dept == "215086", ]
#          Last   Salary time   Dept Distrib
# 2      Mauger  47305.0    1 215086   100.0
# 4  Carruthers 142403.0    1 215086   100.0
# 8      Fester 121991.0    2 215086     5.0
# 15     Vanhoy  47740.0    3 215086    10.0

答案 1 :(得分:0)

我认为

来自rbindlist包的简单data.table。有了可重现的数据,我可以给出可重复的答案:)

res <- rbindlist(list(DF.PSoft[, c('Last', 'Salary', 'Dept1', 'Distrib1')],
    DF.PSoft[, c('Last', 'Salary', 'Dept2', 'Distrib2')],
    DF.PSoft[, c('Last', 'Salary', 'Dept3', 'Distrib3')]    
 ))