sql相当于R查询

时间:2014-03-20 15:20:09

标签: sql r postgresql

我有两个数据集author_data和paper_author

author_data:

author_id       author_name          author_affiliation
 25         William H. Nailon                                                                    
 37         P. B. Littlewood        Cavendish Laboratory|Cambridge University
 44         A. Kuroiwa               Department of Molecular Biology 

paper_author:

paper_id     author_id      author_name      author_affiliation
  1          521630         Ayman Kaheel     Cairo Microsoft Innovation Lab
  1          972575       Mahmoud Refaat     Cairo Microsoft Innovation Lab

我在R

中运行了以下查询
author_data[which(author_data$author_id %in% paper_author$author_id &
                  author_data$author_name %in% paper_author$author_name & 
                  author_data$author_affiliation %in% paper_author$author_affiliation), ]

也就是说,我想查找author_data和paper_author之间的匹配项,其中三列author_idauthor_nameauthor_affiliation匹配。

我已经写了一个查询来在sql中得到这个结果但是我没有把它弄好。我试过的查询是

statement <- "select
              author_data.author_id,
              author_data.author_name,
              author_data.author_affiliation
        FROM author_data
        INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation"

通过这个查询我获得的行数比author_data中的行多,查询应该获取数据,这些数据首先是author_data的子集。我无法弄清楚出了什么问题,因为我对sql很天真。

此查询有什么问题?

由于

2 个答案:

答案 0 :(得分:1)

R中的which和SQL中的join之间存在差异。虽然which将有效地对给定数据框进行子集,但join将返回满足join条件的所有行。我几乎可以肯定,在您的情况下,author_id, author_name, author_affiliation中有多个组合paper_author出现。因此,author_data中的行乘以paper_author中的行。

您的查询几乎正确,您需要添加distinctgroup by或使用exists

鲜明:

select
   distinct
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation

分组:

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation
group by
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation

您还可以使用exists

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
where
   exists (select 1 from paper_author where
       author_data.author_id = paper_author.author_id
       AND author_data.author_name = paper_author.author_name
       AND author_data.author_affiliation = paper_author.author_affiliation
       )

答案 1 :(得分:0)

试试这个。

SELECT author_data.author_id,author_data.author_name,author_data.author_affiliation
FROM author_data, paper_author
WHERE author_data.author_id = paper_author.author_id 
AND author_data.author_name=paper_author.author_name 
AND author_data.author_affiliation=paper_author.author_affiliation