我有两个数据集author_data和paper_author
author_data:
author_id author_name author_affiliation
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
44 A. Kuroiwa Department of Molecular Biology
paper_author:
paper_id author_id author_name author_affiliation
1 521630 Ayman Kaheel Cairo Microsoft Innovation Lab
1 972575 Mahmoud Refaat Cairo Microsoft Innovation Lab
我在R
中运行了以下查询author_data[which(author_data$author_id %in% paper_author$author_id &
author_data$author_name %in% paper_author$author_name &
author_data$author_affiliation %in% paper_author$author_affiliation), ]
也就是说,我想查找author_data和paper_author之间的匹配项,其中三列author_id
,author_name
和author_affiliation
匹配。
我已经写了一个查询来在sql中得到这个结果但是我没有把它弄好。我试过的查询是
statement <- "select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
FROM author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation"
通过这个查询我获得的行数比author_data中的行多,查询应该获取数据,这些数据首先是author_data的子集。我无法弄清楚出了什么问题,因为我对sql很天真。
此查询有什么问题?
由于
答案 0 :(得分:1)
R中的which
和SQL中的join
之间存在差异。虽然which
将有效地对给定数据框进行子集,但join
将返回满足join
条件的所有行。我几乎可以肯定,在您的情况下,author_id, author_name, author_affiliation
中有多个组合paper_author
出现。因此,author_data
中的行乘以paper_author
中的行。
您的查询几乎正确,您需要添加distinct
或group by
或使用exists
:
鲜明:
select
distinct
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
分组:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
group by
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
您还可以使用exists
:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
where
exists (select 1 from paper_author where
author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
)
答案 1 :(得分:0)
试试这个。
SELECT author_data.author_id,author_data.author_name,author_data.author_affiliation
FROM author_data, paper_author
WHERE author_data.author_id = paper_author.author_id
AND author_data.author_name=paper_author.author_name
AND author_data.author_affiliation=paper_author.author_affiliation