可以使用SPSS命令(例如,MERGE FILES)在2个SPSS数据集之间执行左外连接吗?假设连接字段在任一数据集中都不唯一。
实施例: 让左边的Dataset1包含2个字段 - ClassNbr和Fact1 - 以及这4个记录。 。
1 A
1 D
2 A
3 B
让数据集2包含2个字段 - ClassNbr和Fact2 - 以及这3个记录。 。
1 XX
1 XY
3 ZZ
我想在ClassNbr上加入Dataset1和Dataset2。所需的结果是6记录数据集如下:
1 A XX
1 A XY
1 D XX
1 D XY
2 A (NULL)
3 B ZZ
我更喜欢使用SPSS命令的解决方案(而不是SQL / Python /等)。
答案 0 :(得分:2)
据我所知,你不能直接这样做。解决方法的一种可能方法是将数据从长格式“重新整形”为宽格式(使用casestovars
),进行合并,然后重新整形为长格式(使用varstocases
)。下面是一个使用示例(如果需要对代码进行任何澄清)。
data list free / ClassNbr (F1) Fact1 (A1).
begin data
1 A
1 D
2 A
3 B
end data.
dataset name data1.
casestovars
/id = ClassNbr.
data list free / ClassNbr (F1) Fact2 (A2).
begin data
1 XX
1 XY
3 ZZ
end data.
dataset name data2.
casestovars
/id = ClassNbr.
match files file = 'data1'
/file = 'data2'
/by ClassNbr.
execute.
varstocases
/make Fact1 FROM Fact1.1 to Fact1.2
/null = KEEP.
varstocases
/make Fact2 FROM Fact2.1 to Fact2.2
/null = KEEP.
这会创建一些你不想要的情况,在这里我刚刚定义了一组命令来识别这些情况并将它们取出(我确信这可以通过改进来提高效率)。
*now cleaning up the extra records.
compute flag = 0.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = lag(Fact2) flag = 1.
select if flag = 0.
execute.
if Fact1 = " " and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact2 = lag(Fact2) and Fact1 = " " flag = 1.
select if flag = 0.
execute.
我确信有可能使这个更强大(可能会制作一些自定义的python函数)。但希望这有助于您入门。
答案 1 :(得分:2)
如果安装“STATS CARTPROD”扩展包,则可以执行此操作。使用此扩展,您可以创建笛卡尔积作为创建外部联接的中间步骤。
从SPSS 22开始,您可以直接从程序菜单Extra-> Extension Bundles-> Install and Download扩展包中下载它。您也可以从此处手动下载并安装它:https://www.ibm.com/developerworks/community/files/app?lang=en#/file/d0afcd4e-6d5d-4779-84ef-2b68bc81b861 请注意,您必须已安装“Python Essentials for SPSS”才能使其正常运行。
*** create the example data.
DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.
DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
END DATA.
DATASET NAME data2.
在使用“STATS CARTPROD”扩展名时,在变量名中使用大写字母时遇到了问题。 “classnbr”在两个数据集中都有不同的变量名称也很重要。
*** create cartesian product using the STATS CARTPROD extension.
DATASET ACTIVATE data1.
STATS CARTPROD INPUT2=data2
VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = cart.
EXECUTE.
*** create an equi join.
SELECT IF classnbr1 = classnbr2.
EXECUTE.
DELETE VARIABLES classnbr2.
现在包括data2中没有匹配的案例。
*** create left outer join
* assuming both data sets are ordered by classnbr1 and fact1
ADD FILES
/FILE = cart
/FILE = data1
/BY classnbr1 fact1.
EXECUTE.
DATASET NAME outer_join.
DATASET ACTIVATE outer_join.
COMPUTE select=1.
IF (length(fact2)=0 AND classnbr1=LAG(classnbr1) AND fact1=LAG(fact1)) select=0.
EXECUTE.
SELECT IF select = 1.
EXECUTE.
DELETE VARIABLES select.
但是,在使用非常大的数据集时可能会遇到麻烦。在这种情况下,笛卡儿产品将是巨大的。
为了稍微缓解这种影响,您可以从生成笛卡尔积之前的相应其他数据集上的数据集中删除所有案例。
这是怎么做的:
*** create the example data.
*** (I added an additional case to the second data set, which will be deleted
in the result, since it has no match in the first data set)
DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.
DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
4 XY
END DATA.
DATASET NAME data2.
*** select cases who (don't) have a matching correspondent in the other dataset
** Create a list of unique key values of data set data2
** (In this Example the key Value is classnbr2).
DATASET ACTIVATE data2.
DATASET COPY data2_keylist.
DATASET ACTIVATE data2_keylist.
* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr2 = LAG(classnbr2) list = 0.
SELECT IF list=1.
EXECUTE.
* Delete all variables except the (now unique) key value
MATCH FILES
/FILE *
/KEEP classnbr2.
EXECUTE.
** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data1.
MATCH FILES
/FILE *
/TABLE data2_keylist
/RENAME classnbr2=classnbr1
/IN data2
/BY classnbr1.
EXECUTE.
** Remove cases from data1 who don't have a correspondent in data2
** and store them in another dataset, because we need to add them later.
DATASET COPY date1_nomatch.
SELECT IF data2=1.
EXECUTE.
DATASET ACTIVATE date1_nomatch.
SELECT IF data2=0.
EXECUTE.
** Now doing the same for the other data set.
** Create a list of unique key values of data set data1
** (In this Example the key Value is classnbr1).
DATASET ACTIVATE data1.
DATASET COPY data1_keylist.
DATASET ACTIVATE data1_keylist.
* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr1 = LAG(classnbr1) list = 0.
SELECT IF list=1.
EXECUTE.
* Delete all variables except the (now unique) key value
MATCH FILES
/FILE *
/KEEP classnbr1.
EXECUTE.
** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data2.
MATCH FILES
/FILE *
/TABLE data1_keylist
/RENAME classnbr1=classnbr2
/IN data1
/BY classnbr2.
EXECUTE.
** Remove cases from data1 who don't have a correspondent in data2.
SELECT IF data1=1.
EXECUTE.
*** create a cartesian product of the two reduced datasets.
DATASET ACTIVATE data1.
STATS CARTPROD INPUT2=data2
VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = outer_join.
EXECUTE.
*** create an equi join.
SELECT IF classnbr1 = classnbr2.
EXECUTE.
DELETE VARIABLES classnbr2.
*** create left outer join by adding the cases from date1_nomatch.
DATASET ACTIVATE outer_join.
ADD FILES
/FILE = *
/FILE = date1_nomatch
/BY classnbr1 fact1
/DROP data2.
EXECUTE.
* Some cleaning up.
DATASET CLOSE data1_keylist.
DATASET CLOSE date1_nomatch.
DATASET CLOSE data2_keylist.