如何从数据框中删除具有相同值的重复列

时间:2014-02-15 16:19:36

标签: r sas

我有一个像这样的数据框:

1    1    1    a    1    a    
2    1    2    b    1    b    
3    8    3    b    1    b    
4    8    4    k    1    k    
1    1    1    t    1    t    
2    1    2    t    1    t 

我想删除具有相同值的重复列,即第3列是第1列的副本,所以我想删除第3列或第1列,第6列是第4列的副本,所以我要删除一列6或列4.我有非常大的数据,有800列,列名如a1,a2,a3 .... a800。

所以我的结果将是这样的

1    1    a    1       
2    1    b    1        
3    8    b    1       
4    8    k    1    
1    1    t    1   
2    1    t    1 

如果有人帮我完成这项任务,那就太棒了。

感谢您的回复。我将尝试这些代码,如果我在SAS和R中得到任何等价物,那将会很棒。

3 个答案:

答案 0 :(得分:3)

也许其中一个适合你:

创建标识重复列的逻辑向量。

我会猜测 paste方法可能会更快......但是转置duplicated上的data.frame可能更可靠......

Dups1 <- duplicated(lapply(mydf, paste, collapse = ""))
Dups2 <- duplicated(t(mydf))

Dups1
# [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Dups2
# [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

使用其中任何一个逻辑向量来获取所需的列。

mydf[!Dups1]
  V1 V2 V4 V5
1  1  1  a  1
2  2  1  b  1
3  3  8  b  1
4  4  8  k  1
5  1  1  t  1
6  2  1  t  1

答案 1 :(得分:3)

您可以使用

dat[!duplicated(unclass(dat))]

#   V1 V2 V4 V5
# 1  1  1  a  1
# 2  2  1  b  1
# 3  3  8  b  1
# 4  4  8  k  1
# 5  1  1  t  1
# 6  2  1  t  1

其中dat是数据框的名称。

答案 2 :(得分:0)

下面的SAS代码应该为您提供所需的答案。算法很简单:从原始数据的每一行中排出非重复的列对。我使用的主要技术是:

(1)利用set选项point= nobs=对原始数据集执行随机访问

(2)由于SAS数组的约束而单独处理字符/数值变量

(3)在call symputx()中使用data _NULL_将数据集属性存储到宏变量中

/* original dataset */
*only x1,x2,x4,x5 are non-duplicate;
data a; 
    input x1 x2 x3 x4 $ x5 x6 $ x7 x8 $;
    cards;
1    1    1    a    1    a    1    a
2    1    2    b    1    b    2    b
3    8    3    b    1    b    3    b
4    8    4    k    1    k    4    k
1    1    1    t    1    t    1    t
2    1    2    t    1    t    2    t
;run;

/* get the number of columns (character/numeric, respectively) */
data _NULL_;
    set a(firstobs=1 obs=1);
    /* # of numeric vars */
    array arrN{*} _Numeric_;
    x=dim(arrN);
    call symputx("NN",x,"G");

    /* # of char vars */
    array arrC{*} _Character_;
    y=dim(arrC);
    call symputx("NC",y,"G");

    /* #obs of the input dataset */
    dsid=open("a");
    z=attrn(dsid,"NOBS");
    call symputx("N",z,"G");
    w=close(dsid);

    /* check */
    %put NN=&NN, NC=&NC, N=&N;
run;

/* create lists of possible duplicated pairs */
* numeric variables;
data dup_n;
    do i=1 to &NN;
        do j=i+1 to &NN;
            output;
        end;
    end;
run;

* character variables;
data dup_c;
    do i=1 to &NC;
        do j=i+1 to &NC;
            output;
        end;
    end;
run;

/* eliminate possible dup pairs */
* read the original data set, one row at a time, and eliminate 
* non-duplicate pairs from the possible list;

%macro random_access;
%do pt=1 %to &N;
    data dup_N2(keep=i j) dup_C2(keep=i j);
        /* read row &pt from the original dataset */
        set a(firstobs=&pt obs=&pt);

        /* process numeric variables */
        array arrN{*} _Numeric_;
        do k1=1 to N1;
            set dup_N point=k1 NOBS=N1;
            if arrN{i}=arrN{j} then output dup_N2;
        end;

        /* process character variables */
        array arrC{*} _character_;
        do k2=1 to N2;
            set dup_C point=k2 NOBS=N2;
            if arrC{i}=arrC{j} then output dup_C2;
        end;
    run;

    /* renew duplicate pairs */
    proc datasets lib=work nolist;
        delete dup_N dup_C;
        change dup_N2=dup_N dup_C2=dup_C; /*rename datasets*/
    quit;
%end;
%mend;
%random_access;
* kill the macro after use;
%SYSMACDELETE random_access /nowarn; 


/* pick the duplicated rows */
* note: the first number of the duplicates will not appear in column j;
proc sort data=dup_N(drop=i) nodupkey;
    by j;
run;

proc sort data=dup_C(drop=i) nodupkey;
    by j;
run;

/* concatenate the names of duplicate variables */
data _NULL_;
    set a;
    array arrN{*} _Numeric_;
    array arrC{*} _character_;
    format dropN dropC $32767.;
    do k1=1 to N1;
        set dup_N(rename=(j=j1)) point=k1 NOBS=N1;
        dropN=catx(' ',dropN,vname(arrN{j1}));
        put dropN=; *check;
    end;
    do k2=1 to N2;
        set dup_C(rename=(j=j2)) point=k2 NOBS=N2;
        dropC=catx(' ',dropC,vname(arrC{j2}));
        put dropC=; *check;
    end;

    str=catx(' ',dropN,dropC);
    call symputx("dropAll",str,"G");
    stop;
run;
%put dropAll=&dropAll; *check;

/* output answer */
data answer(drop=&dropAll);
    set a;
run;

*cleanup;
proc datasets lib=work nolist;
    delete dup_:;
quit;

我的代码可能无法做到最好。只是表明它是可能的。

致以最诚挚的问候,

比尔