根据前两行

时间:2017-06-13 20:33:10

标签: r

我的大数据集看起来像这样(实际上它有数千列): enter image description here

A = c("AA","AA","AA","AA","AA")
B = c("CC","GG","CC","CG","GG")
C = c("TT","AA","AA","AT","TT")
D = c("GG","GG","GG","GG","GG")
E = c("TT","TT","NA","TT","TT")

mydata = data.frame(A, B, C, D, E)
mydata    

基本上我想做两件事:

  1. 从数据集中删除列,其中第一行和第二行(在列中)的值相同,因此在这种情况下,列“A”,“D”和“E” “将被排除在外。

  2. 更改引用第一行和第二行(列中)中值的单元格的名称:如果单元格与第1行中的单元格具有相同的值,则称为“f”,如果是与第2行“m”相同;否则“h”。

  3. 这是我想在最后获得的表格:

    B = c("CC","GG","f","h","m")
    C = c("TT","AA","m","h","f")
    
    mydata = data.frame(B, C)
    mydata    
    

    对于第一点,我设法通过使用How to remove non-informative columns with and without missing values in dataframe中的应用函数来获得类似的结果,但我想要的是将条件提供给某些单元格,例如使用“if”函数时在excel中。

    我很感激任何使用函数类型的想法。

1 个答案:

答案 0 :(得分:1)

首先要做的是你的字符串是字符而不是因素:

A = c("AA","AA","AA","AA","AA")
B = c("CC","GG","CC","CG","GG")
C = c("TT","AA","AA","AT","TT")
D = c("GG","GG","GG","GG","GG")
E = c("TT","TT","NA","TT","TT")

mydata = data.frame(A, B, C, D, E,stringsAsFactors = F)

然后,在第一步,您可以执行以下操作:

mydata2<-mydata[,!mydata[1,]==mydata[2,]]
mydata2

并且第二步:

mydata2[-c(1:2),]<-lapply(mydata2,function(x)
            ifelse(x[-c(1,2)]==x[1],'f',
                   ifelse(x[-c(1,2)]==x[2],'m','h'))
)

> mydata2
   B  C
1 CC TT
2 GG AA
3  f  m
4  h  h
5  m  f