请考虑以下表格。
users
拥有成千上万的Twitter用户;他们的tweets
被sp100_id
编入索引,这是推文所讨论的公司ID(请参阅sp100
)。 tweets.class
为每条推文保留指定的情绪类(1
=中立,2
=正,3
=负)。 tweets.rt
保留转发推文的次数。最后,每个用户都获得了quality
分和follow
分,如下所示:
users tweets
------------------------- -----------------------------------------------
user_id quality follow tweet_id sp100_id nyse_date user_id class rt
------------------------- -----------------------------------------------
1 2.50 5.00 1 1 2011-03-12 1 1 0
2 0.75 1.00 2 1 2011-03-13 1 2 2
3 1 2011-03-13 1 2 1
daterange 4 1 2011-03-13 2 2 0
---------------- 5 1 2011-03-13 2 3 3
_date 6 2 2011-03-12 2 2 3
---------------- 7 2 2011-03-12 2 2 0
2011-03-11 8 2 2011-03-12 1 3 5
2011-03-12 9 2 2011-03-13 2 2 0
2011-03-13
sp100
----------------
sp100_id _name
----------------
1 Alcoa
2 Apple
所需的输出是按sp100_id
每_date
个列表,每个class=2
加权的正(class=3
)和否定(rt
)推文的数量, '质量'和follow
:
sp100_id nyse_date pos-rt pos-quality pos-follow neg-rt neg-quality neg-follow
--------------------------------------------------------------------------------
1 2011-03-11 0 0 0 0 0 0
1 2011-03-12 0 0 0 0 0 0
1 2011-03-13 5 (1) 5.75 (2) 11.00 (3) 3 (4) 0.75 (5) 1.00 (6)
2 2011-03-11 0 0 0 0 0 0
2 2011-03-12 3 (7) 5.00 (8) 10.00 (9) 5.00 2.50 2.50
2 2011-03-13 0 0.75 1.00 0 0 0
--------------------------------------------------------------------------------
(1) On 2011-03-13, 3 positive tweets for sp100_id 1. 1 tweet retweeted 2 times,
1 tweets retweeted 1 time and 1 tweet retweeted 0 times = 2x2+1x1+1x0 = 5
(2) On 2011-03-13, 2 positive tweets made by user 1, who has quality 2.50 and
1 positive tweet made by user 2, who has quality 0.75 = 2x2.50+1x0.75 = 5.75
(3) On 2011-03-13, 2 positive tweets made by user 1, who has follow 5.00 and
1 positive tweet made by user 2, who has follow 1 = 2x5.00+1x1.00 = 11.00
(4) On 2011-03-13, 1 negative tweet made by user 2, retweeted 3 times = 1x3 = 3
(5) On 2011-03-13, 1 negative tweet made by user 2, who has quality 0.75, thus
1x0.75 = 0.75
(6) On 2011-03-13, 1 negative tweets made by user 2, who has follow 1.00 so
1x1.00 = 1.00
(7) 1 positive tweet which has been retweeted 3 times, 1 positive tweet without
any retweets = 1x3+1x0 = 3
(8) 2 positive tweets from user 2 x quality 2.50 = 5.00
(9) 2 positive tweets x follow 5 = 10.00
我试图尽可能好地解释自己。谁能帮我构建正确的查询?如您所见,还有没有推文的日期(所有值为零)需要包含在结果集中。我现在有这个,但我在完成剩下的工作时遇到了麻烦:
SELECT
s.sp100_id,
d._date,
COALESCE(c.pos-rt,0) AS pos-rt,
COALESCE(c.pos-quality,0) AS pos-quality,
COALESCE(c.pos-follow,0) AS pos-follow,
COALESCE(c.neg-rt,0) AS neg-rt,
COALESCE(c.neg-quality,0) AS neg-quality,
COALESCE(c.neg-follow,0) AS neg-follow
FROM sp100 s
CROSS JOIN daterange d
LEFT JOIN (
SELECT
sp100_id,
nyse_date,
COUNT(CASE class WHEN 2 THEN 1 END) * [rt] AS pos-rt,
COUNT(CASE class WHEN 2 THEN 1 END) * [quality] AS pos-quality,
COUNT(CASE class WHEN 2 THEN 1 END) * [follow] AS pos-follow,
COUNT(CASE class WHEN 3 THEN 1 END) * [rt] AS neg-rt,
COUNT(CASE class WHEN 3 THEN 1 END) * [quality] AS neg-quality,
COUNT(CASE class WHEN 3 THEN 1 END) * [follow] AS neg-follow
FROM tweets
GROUP BY sp100_id, nyse_date
) c ON s.sp100_id = c.sp100_id AND d._date = c.nyse_date
ORDER BY s.sp100_id, d._date ASC
显然,[rt]
,[quality]
和[follow]
需要用正确的语法替换,我也不确定COUNT(...)
,因为它现在首先计算推文的数量,但它应该把每个推文分开并乘以它自己的转推数量('rt')。
有人可以帮帮我吗?
答案 0 :(得分:2)
假设我已正确理解问题(请参阅上面的评论),那么您只需要对连接的表进行分组,并SUM()
将推文属于所需类的相关字段,可以使用{ {3}}:
SELECT sp100.sp100_id AS `sp100_id`,
daterange._date AS `nyse_date`,
SUM(IF(tweets.class=2, tweets.rt, 0)) AS `pos-rt`,
SUM(IF(tweets.class=2, users.quality, 0)) AS `pos-quality`,
SUM(IF(tweets.class=2, users.follow, 0)) AS `pos-follow`,
SUM(IF(tweets.class=3, tweets.rt, 0)) AS `neg-rt`,
SUM(IF(tweets.class=3, users.quality, 0)) AS `neg-quality`,
SUM(IF(tweets.class=3, users.follow, 0)) AS `neg-follow`
FROM sp100
JOIN daterange
LEFT JOIN tweets ON tweets.nyse_date = daterange._date
AND tweets.sp100_id = sp100.sp100_id
LEFT JOIN users ON tweets.user_id = users.user_id
GROUP BY sp100.sp100_id, daterange._date
在IF()
上查看。
[编辑]这是EXPLAIN
:
id select_type table type possible_keys key key_len ref rows extra
-----------------------------------------------------------------------------------------------------------------------------------------------------------
1 SIMPLE sp100 index NULL PRIMARY 4 NULL 101 Using index; Using temporary; Using filesort
1 SIMPLE daterange index NULL _date 3 NULL 147 Using index; Using join buffer
1 SIMPLE tweets ref query,nyse_date,sp100_id nyse_date 3 sentimeter.daterange._date 3815
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 sentimeter.tweets.user_id 1