从不同数据帧行中的元素创建变量

时间:2013-10-08 19:25:40

标签: r

我一直在努力在我的数据框中创建一些看起来像这样的变量:

df.1 <- data.frame(unit = c('A','B','C','A','B','C','D'),location = c(1,1,1,2,2,2,2), value.X = c('5','6', '4', '3','10', '7','3'),value.Y = c('1','4','7','9','4','6','4'),team = c('A / B', 'A / B', 'C' , 'A', 'B / C', 'B / C','D'),team.B = c('A / C ', 'A / C', 'B', 'A / B / D', 'A / B / D', 'C', 'A / B / D'),supra = c('A', 'B', 'C', 'A / C / D', 'B', 'A / C / D' , 'A / C / D'),pos.supra = c(1,2,3,1,2,1,1))

  unit location value.X value.Y  team    team.B     supra pos.supra
1    A        1       5       1 A / B    A / C          A         1
2    B        1       6       4 A / B     A / C         B         2
3    C        1       4       7     C         B         C         3
4    A        2       3       9     A A / B / D A / C / D         1
5    B        2      10       4 B / C A / B / D         B         2
6    C        2       7       6 B / C         C A / C / D         1
7    D        2       3       4     D A / B / D A / C / D         1

我需要创建一个变量,该变量对value.X中不在value.Y且不在team.B中的单位的teamsupra之间的差异求和{1}}。如果相关单位的pos.supra.1等于1,则pos.supra.1是第一个或紧接在下面的unit。对于每个location中的每个supra,我都需要这样做。我知道步骤太多了,所以这里有更详细的描述。也许你可以跳过或颠倒这些步骤的顺序。没关系。

(1)找到排名第一或更低的supra小组(如果单位有pos.supra1等于supra.I.need = c('B','A','A','B','A / C / D', 'B','B')

who.I.need

(2)检查team中的人是否在team.B但在that.is.not.in.team.but.are.in.team.B = c('NA','NA','NA','B', 'A,D','NA','B') 中:

value.Y

(3)最后,计算上述变量中​​所有单位的value.XA之间的差异并将它们相加(注意我将D和{{1}的总和相加}}):

delta = c('NA','NA','NA','8','2','NA','8')

因此,最终数据框应如下所示:

df.2 <- data.frame(unit = c('A','B','C','A','B','C','D'),location = c(1,1,1,2,2,2,2), value.X = c('5','6', '4', '3','10', '7','3'),value.Y = c('1','4','7','9','4','6','4'),team = c('A / B', 'A / B', 'C' , 'A', 'B / C', 'B / C','D'),team.B = c('A / C ', 'A / C', 'B', 'A / B / D', 'A / B / D', 'C', 'A / B / D'),supra = c('A', 'B', 'C', 'A / C / D', 'B', 'A / C / D' , 'A / C / D'),pos.supra = c(1,2,3,1,2,1,1),supra.I.need = c('B','A','A','B','A / C / D', 'B','B'),that.is.not.in.team.but.are.in.team.B = c('NA','NA','NA','B', 'A,D','NA','B'),delta = c('NA','NA','NA','8','2','NA','8'))

  unit location value.X value.Y  team    team.B     supra pos.supra supra.I.need that.is.not.in.team.but.are.in.team.B delta
1    A        1       5       1 A / B    A / C          A         1            B                                    NA    NA
2    B        1       6       4 A / B     A / C         B         2            A                                    NA    NA
3    C        1       4       7     C         B         C         3            A                                    NA    NA
4    A        2       3       9     A A / B / D A / C / D         1            B                                     B     8
5    B        2      10       4 B / C A / B / D         B         2    A / C / D                                   A,D     2
6    C        2       7       6 B / C         C A / C / D         1            B                                    NA    NA
7    D        2       3       4     D A / B / D A / C / D         1            B                                     B     8

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:2)

这是一次性的。其中大部分是创建变量或匹配多个结果并使用%in%进行子集化。我在最后一步陷入困境,所以循环很容易。我稍微注释了代码以显示我在做什么。

请注意,所有这些都在使用data.frame中的stringsAsFactors = FALSE处理字符向量。我不确定为什么你的数字向量都被输入为字符向量,但如果这不是你的实际数据集,你可以避免使用as.numeric

require(plyr)
# create the supra needed when pos.supra is 1 or not
df1 = ddply(df.1, .(location), transform,
        needed = ifelse(pos.supra == 1, supra[pos.supra == 2], supra[pos.supra == 1]) )

# break apart the teams into lists for team, team.B, needed
    # the result is a list
# strsplit needs character vectors, not factors
team = strsplit(df1$team, " / ")
teamb = strsplit(df1$team.B, " / ")
needs = strsplit(as.character(df1$needed), " / ")

# pull out everything in team b that's not in team
b.not.team = mapply(function(x, y) x[!x %in% y], teamb, team)

# now match needed supra and everything in team b but not team and
    # paste together the results with a comma between and put in df1
df1$bneeded = mapply(function(x, y) paste0(x[x %in% y], collapse = ","), needs, b.not.team)


for (i in 1:nrow(df1) ){
    matchto = unlist(strsplit(df1$bneeded[i], ","))
    diffs = as.numeric(df1$value.X[df1$unit %in% matchto]) -
        as.numeric(df1$value.Y[df1$unit %in% matchto])
    df1$delta[i] = sum(diffs)
}

df1$bneeded[df1$bneeded == ""] = NA
df1$delta[df1$delta == 0] = NA
df1

**编辑循环替代** 这是一个创建x和y之间差异的循环的替代方法。有时你需要的只是一个新的早晨才能意识到你的代码中出了什么问题。 ;)我喜欢在许多情况下循环,因为它可以很容易地读取代码中发生的事情。在这种情况下,我在其余代码中使用了mapply,因此这里是mapply选项。

df1$diffxy = mapply(function(x, y) sum(as.numeric(df1$value.X[x %in% y])) - 
                    sum(as.numeric(df1$value.Y[x %in% y])),
      df1["unit"], strsplit(df1$bneeded, ","))