我有一个像这样的数据框:
1 1 1 a 1 a
2 1 2 b 1 b
3 8 3 b 1 b
4 8 4 k 1 k
1 1 1 t 1 t
2 1 2 t 1 t
我想删除具有相同值的重复列,即第3列是第1列的副本,所以我想删除第3列或第1列,第6列是第4列的副本,所以我要删除一列6或列4.我有非常大的数据,有800列,列名如a1,a2,a3 .... a800。
所以我的结果将是这样的
1 1 a 1
2 1 b 1
3 8 b 1
4 8 k 1
1 1 t 1
2 1 t 1
如果有人帮我完成这项任务,那就太棒了。
感谢您的回复。我将尝试这些代码,如果我在SAS和R中得到任何等价物,那将会很棒。
答案 0 :(得分:3)
也许其中一个适合你:
我会猜测 paste
方法可能会更快......但是转置duplicated
上的data.frame
可能更可靠......
Dups1 <- duplicated(lapply(mydf, paste, collapse = ""))
Dups2 <- duplicated(t(mydf))
Dups1
# [1] FALSE FALSE TRUE FALSE FALSE TRUE
Dups2
# [1] FALSE FALSE TRUE FALSE FALSE TRUE
mydf[!Dups1]
V1 V2 V4 V5
1 1 1 a 1
2 2 1 b 1
3 3 8 b 1
4 4 8 k 1
5 1 1 t 1
6 2 1 t 1
答案 1 :(得分:3)
您可以使用
dat[!duplicated(unclass(dat))]
# V1 V2 V4 V5
# 1 1 1 a 1
# 2 2 1 b 1
# 3 3 8 b 1
# 4 4 8 k 1
# 5 1 1 t 1
# 6 2 1 t 1
其中dat
是数据框的名称。
答案 2 :(得分:0)
下面的SAS代码应该为您提供所需的答案。算法很简单:从原始数据的每一行中排出非重复的列对。我使用的主要技术是:
(1)利用set
选项point= nobs=
对原始数据集执行随机访问
(2)由于SAS数组的约束而单独处理字符/数值变量
(3)在call symputx()
中使用data _NULL_
将数据集属性存储到宏变量中
/* original dataset */
*only x1,x2,x4,x5 are non-duplicate;
data a;
input x1 x2 x3 x4 $ x5 x6 $ x7 x8 $;
cards;
1 1 1 a 1 a 1 a
2 1 2 b 1 b 2 b
3 8 3 b 1 b 3 b
4 8 4 k 1 k 4 k
1 1 1 t 1 t 1 t
2 1 2 t 1 t 2 t
;run;
/* get the number of columns (character/numeric, respectively) */
data _NULL_;
set a(firstobs=1 obs=1);
/* # of numeric vars */
array arrN{*} _Numeric_;
x=dim(arrN);
call symputx("NN",x,"G");
/* # of char vars */
array arrC{*} _Character_;
y=dim(arrC);
call symputx("NC",y,"G");
/* #obs of the input dataset */
dsid=open("a");
z=attrn(dsid,"NOBS");
call symputx("N",z,"G");
w=close(dsid);
/* check */
%put NN=&NN, NC=&NC, N=&N;
run;
/* create lists of possible duplicated pairs */
* numeric variables;
data dup_n;
do i=1 to &NN;
do j=i+1 to &NN;
output;
end;
end;
run;
* character variables;
data dup_c;
do i=1 to &NC;
do j=i+1 to &NC;
output;
end;
end;
run;
/* eliminate possible dup pairs */
* read the original data set, one row at a time, and eliminate
* non-duplicate pairs from the possible list;
%macro random_access;
%do pt=1 %to &N;
data dup_N2(keep=i j) dup_C2(keep=i j);
/* read row &pt from the original dataset */
set a(firstobs=&pt obs=&pt);
/* process numeric variables */
array arrN{*} _Numeric_;
do k1=1 to N1;
set dup_N point=k1 NOBS=N1;
if arrN{i}=arrN{j} then output dup_N2;
end;
/* process character variables */
array arrC{*} _character_;
do k2=1 to N2;
set dup_C point=k2 NOBS=N2;
if arrC{i}=arrC{j} then output dup_C2;
end;
run;
/* renew duplicate pairs */
proc datasets lib=work nolist;
delete dup_N dup_C;
change dup_N2=dup_N dup_C2=dup_C; /*rename datasets*/
quit;
%end;
%mend;
%random_access;
* kill the macro after use;
%SYSMACDELETE random_access /nowarn;
/* pick the duplicated rows */
* note: the first number of the duplicates will not appear in column j;
proc sort data=dup_N(drop=i) nodupkey;
by j;
run;
proc sort data=dup_C(drop=i) nodupkey;
by j;
run;
/* concatenate the names of duplicate variables */
data _NULL_;
set a;
array arrN{*} _Numeric_;
array arrC{*} _character_;
format dropN dropC $32767.;
do k1=1 to N1;
set dup_N(rename=(j=j1)) point=k1 NOBS=N1;
dropN=catx(' ',dropN,vname(arrN{j1}));
put dropN=; *check;
end;
do k2=1 to N2;
set dup_C(rename=(j=j2)) point=k2 NOBS=N2;
dropC=catx(' ',dropC,vname(arrC{j2}));
put dropC=; *check;
end;
str=catx(' ',dropN,dropC);
call symputx("dropAll",str,"G");
stop;
run;
%put dropAll=&dropAll; *check;
/* output answer */
data answer(drop=&dropAll);
set a;
run;
*cleanup;
proc datasets lib=work nolist;
delete dup_:;
quit;
我的代码可能无法做到最好。只是表明它是可能的。
致以最诚挚的问候,
比尔