使用R中的Sqlite获取列中最常见的值

时间:2016-07-31 11:45:26

标签: sql r sqlite

我有2张桌子,第一张叫做运动员,有 列包括:

athlete_id,名字,姓氏,COUNTRY_ID

第二个表叫做奖牌和

有列:

年,体育,事件,athlete_id,奖章

这是运动员的一部分:

 athlete_id first_name last_name country_id
1 BURKETOM01        Tom     Burke        USA
2 HOFMAFRI01      Fritz   Hofmann        DEU
3  LANEFRA01    Francis      Lane        USA
4 SZOKOALA01     Alajos  Szokolyi        HUN
5 BURKETOM01        Tom     Burke        USA
6 JAMISHER01    Herbert   Jamison        USA

这就是奖牌的一部分:

year         sport    event athlete_id  medal

1 1896 Track & Field 100m Men BURKETOM01   GOLD
2 1896 Track & Field 100m Men HOFMAFRI01 SILVER
3 1896 Track & Field 100m Men  LANEFRA01 BRONZE
4 1896 Track & Field 100m Men SZOKOALA01 BRONZE
5 1896 Track & Field 400m Men BURKETOM01   GOLD
6 1896 Track & Field 400m Men JAMISHER01 SILVER

我需要找到奖牌数量最多的运动员和他赢得的奖牌数量。 正确的答案是

 full_name top_no_medals
1 Larisa Latynina 18

我看到了一些与此类似的帖子并尝试使用那里建议的内容 这是我的代码:

dbGetQuery(olympics.db,statement = 
             'SELECT Athletes.first_name ||" "|| Athletes.last_name AS full_name,COUNT(Medals.athlete_id) 
           AS top_no_medals
           FROM Athletes
           JOIN  Medals ON Medals.athlete_id=Athletes.athlete_id
           GROUP BY full_name
           ORDER BY COUNT(*) DESC
           LIMIT 1'
)

我要做的是根据变量athlete_id和计数这个变量组合2个表。 出于某种原因,我得到的答案是奖牌太多了。

          full_name top_no_medals
1   Larisa Latynina           324

似乎在宣传奖牌时出现了错误。 现在我确定数据是正确的,当我检查奖牌表看 什么是最常见的运动员_我得到了正确的运动员。 这是代码和答案:

dbGetQuery(olympics.db,statement = ' Select athlete_id, count(*) AS top_no_medals
 From Medals
Group By athlete_id
ORDER BY COUNT(*) DESC
LIMIT 1')

 athlete_id top_no_medals
1 LATYNLAR01            18

这名运动员是与Larisa Latynina有关的运动员,所以这不是问题 在奖牌表中。

2 个答案:

答案 0 :(得分:2)

基于您的第二个查询有效的问题,问题似乎是列athletes(athlete_id)。这似乎是重复的。

尝试运行此查询:

select athlete_id, count(*) as cnt
from athletes
group by athlete_id
order by count(*) desc;

最大cnt应为1.如果不是,则表示您的数据存在问题。

哦,别担心。您的Atheletes 有重复项。假设名称始终相同,您可以这样做:

SELECT a.full_name, COUNT(*) AS top_no_medals
FROM (SELECT a.athlete_id,
             MAX(a.first_name || ' ' || a.last_name) as full_name
      FROM Athletes a 
      GROUP BY athlete_id
     ) a
     Medals m
     ON m.athlete_id = a.athlete_id
GROUP BY full_name
ORDER BY COUNT(*) DESC
LIMIT 1;

您还可以修改查询以使用COUNT(DISTINCT),假设每个奖牌都有唯一标识符:

SELECT a.first_name || ' ' || a.last_name AS full_name,
       COUNT(DISTINCT m.medal_id) as top_no_medals
FROM Athletes a JOIN
     Medals m
     ON m.athlete_id = a.athlete_id
GROUP BY full_name
ORDER BY COUNT(DISTINCT m.medal_id) DESC
LIMIT 1;

答案 1 :(得分:0)

鉴于第二个查询(获得奖牌计数)有效,我建议首先计算它然后加入运动员以获得与该ID匹配的名称:

SELECT a.first_name || ' ' || a.last_name AS full_name, top_no_medals
FROM (SELECT athlete_id, count(*) AS top_no_medals
      FROM Medals
      GROUP BY athlete_id
      ORDER BY COUNT(*) DESC
      LIMIT 1) m
JOIN Athletes a ON m.athlete_id = a.athlete_id