我有~2300个CSV文件,每个CSV文件的colunm 1变量名称不同。我想通过panelistID(colunm 2)合并所有文件,并在第1列上运行频率以获取每个CSV文件的频率。请有人帮忙吗?
下面的文件布局示例:
File1
mat1_pen1, panelistID
0, 10075001
20, 10086001
44, 10086002
10, 10096001
File2
mat2_pen2, panelistID
74, 10118002
40, 10118003
77, 10128001
77, 10128003
file 3
mat3_pen4, panelistID
77, 10128003
51, 10137001
0, 10148001
0, 10148002
0, 10157001
答案 0 :(得分:4)
只需在infile
语句中使用通配符读取所有文件,并使用filename=
选项将当前文件存储在临时变量_f
中,并将其存储到{{ 1}}。
然后相应地操纵f
和f
。
data big ; length _f f $256. ; infile "*.csv" truncover filename=_f dlm=',' ; f = _f ; input var panellistID ; run ;
答案 1 :(得分:1)
filename mycsv "*.csv";
data mydataset(drop=tmp);
infile mycsv dsd dlm=',' eov=eov;
retain mat_pen_id;
if _n_ = 1 or eov then do; *when using wildcard-concatenated input files, ;
input mat_pen_id $20. tmp $20.; *eov is true for first line of second file.;
eov = 0;
else do; * _n_ =1 is true for first line of first file only;
input mat_pen panelistID;
end;
run;
proc sort data= mydataset;
by panelistID;
run;
proc transpose
data=mydataset out=wide_data;
by panelistID;
id mat_pen_id;
var mat_pen;
run;
proc print data=wide_data;
run;
这将为您提供一个名为wide_data的数据集,如:
obs panelistID mat1_pen1 mat2_pen2 mat3_pen3 etc
1 10075001 0 22 33