Question

所以我试图简化使用R在数据框中生成新列的代码。我的数据组织如下（第1-4列），我想生成列5（未标记）：

col1__col2__col3__col4______
 t1    f1    A    20     0
 t1    f2    A    19     0
 t1    f3    A    21     0
 t1    f1    B    25     5
 t1    f2    B    25     6
 t1    f3    B    26     5
 t2    f1    A    18     0
 t2    f2    A    19     0
 t2    f3    A    18     0
 t2    f1    B    20     2
 t2    f2    B    20     1
 t2    f3    B    20     2

编辑：第5列看起来像这样（方程式）。它从col4获取t1，f1的值，并从t1，f1减去col3 =＆＃34; A＆＃34;。因此在第1行中，它需要20并减去相同的20.对于第4行，它需要25，并减去在第1行中找到的20，因为这两行都是从处理t1引用的样本f1，但我正在测量两个不同事物的价值（A和B）。因此，如上所述的第5列计算如下：

col5
(20-20)
(19-19)
(21-21)
(25-20)
(25-19)
(26-21)
etc...

添加列非常简单，但我无法找到在所有条件下构建的好方法。如果有人建议如何编写此代码，和/或如何更好地组织我的数据以使事情变得更容易，我将非常感激！到目前为止，我只是在MS excel手动生成第5列的值：\

干杯

Edit2：已回答。非常感谢所有回复的人！

Answer 1

因此，如果我理解正确，如果col3 == "B"，那么您将匹配的行放在col3 == "A"，并从col4中减去相应的值？然后，你需要这样的东西（假设你的数据框叫df）：

for(i in 1:dim(df)[1] {
  if(df[i, 3] == "B") {
    df[i, 5] <- df[i, 4] - df[which(df[1:(i-1), 1] == df[i, 1] & df[1:(i-1),2] == df[i, 2] & df[1:(i-1),3] == "A"), 4]
  }
}

修正了原帖中的拼写错误。

Answer 2

df = df[order(df$col1,df$col3,df$col2),]          ## make sure you have it ordered right
flength = length(unique(df$col2))            ## get the length of unique col2
alength = length(unique(df$col3))            ## get the length of unique col3
Avector = df[df$col3=="A","col4"]             ## get the elements of col 4 with col3="A"
sapplyVec = (1:alength) - 1                  ## create vector to sapply over

## take the elements in Avector in sections of size flength and repeat those
## section alength times.
Avector = c(sapply(sapplyVec ,function(x) rep(Avector[c(1:flength)+(x*flength)],alength)))

这将从col4创建向量，其中col3 =“A”。然后它重复大小为flength（在你的情况下为3），长度为2（在你的情况下为2）。从这里您可以添加新的colums作为col4 - Avector

df $ col5 = df $ col4 - Avector

Answer 3

虽然user2864849的系统适用于此示例数据帧，但在尝试将其应用于我的实际数据时，它最终会产生第5列中值的两倍的输出。我无法弄清楚原因，但这与它如何处理sapply功能有关。重新访问这个问题，我意识到有一个非常简单的，虽然更长的编码解决方案可以工作，并提醒用户使用来自用户的代码来生成排序数据的新向量。

我为第3列中的每个子集生成了第4列中值的向量。然后我对数据帧进行了排序，使得它将以与我生成的向量的顺序相同的形式出现。然后我创建了一个新的向量组合这些单独的向量来生成第5列。最后，我将第5列添加到已排序的数据框中。

#Define variables - optional
col1<-as.factor(df$col1)
col2<-as.factor(df$col2)
col3<-as.factor(df$col3)
col4<-df$col4

## Create vectors of Cq values for each gene
col3Avec = df[col3=="A","col4"]  
col3Bvec = df[col3=="B","col4"]

#Create vectors of dCq values of each gene
col5A<-col3Avec-col3Avec
Col5B<-col3Bvec-col3Avec

#Sort dataframe so its order matches the order of the dCq vectors
dfsort <- df[order(col3,col1,col2),]

#Add dCq vectors in correct order as new column to sorted dataframe
dfsort$col5<-c(col5A,col5B)

#Total = 6 lines of codes not including variable definitions

无论长度或样本大小不相等，我认为这种方法都会成功。看起来很多代码，但是如果所有变量都在您应用此代码的数据中一致地命名，则需要进行最少的重新编码才能应用它。

使用R中的等式添加列

3 个答案: