连接4个表中的数据以计算几个加权分数

时间:2012-07-31 16:41:45

标签: mysql

请考虑以下表格。

users拥有成千上万的Twitter用户;他们的tweetssp100_id编入索引,这是推文所讨论的公司ID(请参阅sp100)。 tweets.class为每条推文保留指定的情绪类(1 =中立,2 =正,3 =负)。 tweets.rt保留转发推文的次数。最后,每个用户都获得了quality分和follow分,如下所示:

users                       tweets
-------------------------   -----------------------------------------------
user_id quality follow      tweet_id sp100_id nyse_date   user_id class  rt
-------------------------   -----------------------------------------------
1       2.50    5.00        1        1        2011-03-12  1       1      0
2       0.75    1.00        2        1        2011-03-13  1       2      2
                            3        1        2011-03-13  1       2      1
daterange                   4        1        2011-03-13  2       2      0
----------------            5        1        2011-03-13  2       3      3
_date                       6        2        2011-03-12  2       2      3
----------------            7        2        2011-03-12  2       2      0
2011-03-11                  8        2        2011-03-12  1       3      5
2011-03-12                  9        2        2011-03-13  2       2      0
2011-03-13

sp100
----------------
sp100_id  _name
----------------
1         Alcoa
2         Apple

所需的输出是按sp100_id_date个列表,每个class=2加权的正(class=3)和否定(rt)推文的数量, '质量'和follow

sp100_id  nyse_date  pos-rt pos-quality pos-follow neg-rt neg-quality neg-follow
--------------------------------------------------------------------------------
1         2011-03-11 0      0           0          0      0           0
1         2011-03-12 0      0           0          0      0           0
1         2011-03-13 5 (1)  5.75 (2)    11.00 (3)  3 (4)  0.75 (5)    1.00 (6)
2         2011-03-11 0      0           0          0      0           0
2         2011-03-12 3 (7)  5.00 (8)    10.00 (9)  5.00   2.50        2.50
2         2011-03-13 0      0.75        1.00       0      0           0
--------------------------------------------------------------------------------

(1) On 2011-03-13, 3 positive tweets for sp100_id 1. 1 tweet retweeted 2 times,
    1 tweets retweeted 1 time and 1 tweet retweeted 0 times = 2x2+1x1+1x0 = 5
(2) On 2011-03-13, 2 positive tweets made by user 1, who has quality 2.50 and
    1 positive tweet made by user 2, who has quality 0.75 = 2x2.50+1x0.75 = 5.75
(3) On 2011-03-13, 2 positive tweets made by user 1, who has follow 5.00 and
    1 positive tweet made by user 2, who has follow 1 = 2x5.00+1x1.00 = 11.00
(4) On 2011-03-13, 1 negative tweet made by user 2, retweeted 3 times = 1x3 = 3
(5) On 2011-03-13, 1 negative tweet made by user 2, who has quality 0.75, thus
    1x0.75 = 0.75
(6) On 2011-03-13, 1 negative tweets made by user 2, who has follow 1.00 so
    1x1.00 = 1.00
(7) 1 positive tweet which has been retweeted 3 times, 1 positive tweet without
    any retweets = 1x3+1x0 = 3
(8) 2 positive tweets from user 2 x quality 2.50 = 5.00
(9) 2 positive tweets x follow 5 = 10.00

我试图尽可能好地解释自己。谁能帮我构建正确的查询?如您所见,还有没有推文的日期(所有值为零)需要包含在结果集中。我现在有这个,但我在完成剩下的工作时遇到了麻烦:

SELECT
    s.sp100_id,
    d._date,
    COALESCE(c.pos-rt,0)      AS pos-rt,
    COALESCE(c.pos-quality,0) AS pos-quality,
    COALESCE(c.pos-follow,0)  AS pos-follow,
    COALESCE(c.neg-rt,0)      AS neg-rt,
    COALESCE(c.neg-quality,0) AS neg-quality,
    COALESCE(c.neg-follow,0)  AS neg-follow
FROM sp100 s
CROSS JOIN daterange d
LEFT JOIN (
    SELECT 
        sp100_id,
        nyse_date, 
        COUNT(CASE class WHEN 2 THEN 1 END) * [rt]      AS pos-rt,
        COUNT(CASE class WHEN 2 THEN 1 END) * [quality] AS pos-quality,
        COUNT(CASE class WHEN 2 THEN 1 END) * [follow]  AS pos-follow,
        COUNT(CASE class WHEN 3 THEN 1 END) * [rt]      AS neg-rt,
        COUNT(CASE class WHEN 3 THEN 1 END) * [quality] AS neg-quality,
        COUNT(CASE class WHEN 3 THEN 1 END) * [follow]  AS neg-follow
    FROM tweets 
    GROUP BY sp100_id, nyse_date
) c ON s.sp100_id = c.sp100_id AND d._date = c.nyse_date
ORDER BY s.sp100_id, d._date ASC

显然,[rt][quality][follow]需要用正确的语法替换,我也不确定COUNT(...),因为它现在首先计算推文的数量,但它应该把每个推文分开并乘以它自己的转推数量('rt')。

有人可以帮帮我吗?

1 个答案:

答案 0 :(得分:2)

假设我已正确理解问题(请参阅上面的评论),那么您只需要对连接的表进行分组,并SUM()将推文属于所需类的相关字段,可以使用{ {3}}:

SELECT      sp100.sp100_id                            AS `sp100_id`,
            daterange._date                           AS `nyse_date`,
            SUM(IF(tweets.class=2, tweets.rt,     0)) AS `pos-rt`,
            SUM(IF(tweets.class=2, users.quality, 0)) AS `pos-quality`,
            SUM(IF(tweets.class=2, users.follow,  0)) AS `pos-follow`,
            SUM(IF(tweets.class=3, tweets.rt,     0)) AS `neg-rt`,
            SUM(IF(tweets.class=3, users.quality, 0)) AS `neg-quality`,
            SUM(IF(tweets.class=3, users.follow,  0)) AS `neg-follow`       
FROM        sp100
       JOIN daterange
  LEFT JOIN tweets ON tweets.nyse_date = daterange._date
                  AND tweets.sp100_id  = sp100.sp100_id
  LEFT JOIN users  ON tweets.user_id   = users.user_id
GROUP BY    sp100.sp100_id, daterange._date

IF()上查看。

[编辑]这是EXPLAIN

id select_type table     type   possible_keys             key        key_len  ref                        rows  extra
-----------------------------------------------------------------------------------------------------------------------------------------------------------
1  SIMPLE      sp100     index  NULL                      PRIMARY    4        NULL                        101  Using index; Using temporary; Using filesort
1  SIMPLE      daterange index  NULL                      _date      3        NULL                        147  Using index; Using join buffer
1  SIMPLE      tweets    ref    query,nyse_date,sp100_id  nyse_date  3        sentimeter.daterange._date 3815    
1  SIMPLE      users     eq_ref PRIMARY                   PRIMARY    4        sentimeter.tweets.user_id     1