用蜂巢,正则表达式提取分析高音扬声器

时间:2014-07-10 07:47:00

标签: regex twitter hive

我正在尝试分析7月份最受欢迎的标签。到目前为止,我可以从7月份选择推文,或者显示最受欢迎的推文,但我没有将它们组合在一起。我正在考虑用7月的推文创建一个中间表,然后显示流行的主题标签,但我不知道怎么样,你能帮助我吗? 2级选择怎么样(从表中的选择b中选择一个?)

SELECT hashtags.text, count(*) as total FROM tweets
WHERE regexp_extract(created_at, "(Tue) (Jul)*", 2) = "Jul"
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text), created_at
ORDER BY total_count DESC
LIMIT 200

问候,K。

1 个答案:

答案 0 :(得分:0)

到目前为止,我做到了这一点,这正是我想要的,但是有什么意义可以达到这个目的吗?

使用嵌套查询:

SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
  SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
  ) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15

编辑:

好的,所以如果你想要你也可以通过临时表来实现:

CREATE TABLE tmpdb (
   id BIGINT,
   created_at STRING,
   source STRING,
   favorited BOOLEAN,
   retweet_count INT,
   retweeted_status STRUCT<
      text:STRING,
      user:STRUCT<screen_name:STRING,name:STRING>>,
   entities STRUCT<
      urls:ARRAY<STRUCT<expanded_url:STRING>>,
      user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
      hashtags:ARRAY<STRUCT<text:STRING>>>,
   text STRING,
   user STRUCT<
      screen_name:STRING,
      name:STRING,
      friends_count:INT,
      followers_count:INT,
      statuses_count:INT,
      verified:BOOLEAN,
      utc_offset:INT,
      time_zone:STRING>,
   in_reply_to_screen_name STRING
) 
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'

然后你更新它:

INSERT OVERWRITE TABLE tmpdb
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"

请求变得如此简单:

SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM tmpdb
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15

关于第二种方法的利弊: 如果您需要准确的请求,则需要更新表,因此它不适合一次性请求,但如果您需要对数据库的当前状态执行多个请求,则此方法更好。 别忘了,复制数据库是一项代价高昂的操作!所以知道何时使用它:)