我试图在我的推特数据中找到前十名(@xxxxx)。我创建了初始表twitter.full_text_ts并用我的数据加载它。
create table twitter.full_text_ts as
select id, cast(concat(substr(ts,1,10), ' ', substr(ts,12,8)) as timestamp) as ts, lat, lon, tweet
from full_text;
我已经能够通过使用此查询(模式)提取推文中的提及
select id, ts, regexp_extract(lower(tweet), '(.*)@user_(\\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50;
执行此操作会给我
USER_a3ed4b5a 2010-03-07 03:46:23 fffed220
USER_dc8cfa6f 2010-03-05 18:28:39 fffdabf9
USER_dc8cfa6f 2010-03-05 18:32:55 fffdabf9
USER_915e3f8c 2010-03-07 03:39:09 fffdabf9
and so on...
你可以看到fffed220等是提取的模式。
现在我想做的是计算每次提及(模式)的次数并输出前10位。例如fffdabf9发生20次,fffxxxx发生17次,依此类推。
答案 0 :(得分:0)
with mentions as
(select id, ts,
regexp_extract(lower(tweet), '(.*)@user_(\\S{8})([:| ])(.*)',2) as patterns
from twitter.full_text_ts
order by patterns desc
limit 50)
select patterns, count(*)
from mentions
group by patterns
order by count(*) desc
limit 10;
答案 1 :(得分:0)
最可读的方法是将第一个查询保存到临时表中,然后在临时表上执行groupby:
create table tmp as
--your query
select patterns, count(*) n_mentions
from tmp
group by patterns
order by count(*) desc
limit 10;