在Twitter上搜索包含单词而非精确匹配的主题标签

时间:2016-04-24 01:19:58

标签: twitter match twitter-streaming-api

我正在使用Tweepy来抓取推特。我一直在浏览Streaming API,在“track”下,它会在搜索推文时显示可能的返回值。

https://dev.twitter.com/streaming/overview/request-parameters

在大多数情况下,似乎API只返回完全匹配,(以及一些额外的情况,准时直接跟随或表面)我正在搜索SELECT a.id, a.bcity, a.bstate,b.combo FROM master a JOIN (SELECT id, SUM(doubles) + SUM(triples) as combo FROM batting GROUP by id) b JOIN WHERE a.id = b.id GROUP BY a.bcity,a.bstate Order by combo DESC limit 5; 的推文,并在下面的主题标签中找到某个单词,例如CREATE EXTERNAL TABLE IF NOT EXISTS batting (id STRING, year INT, team STRING, league STRING, games INT, ab INT, runs INT, hits INT, doubles INT, triples INT, homeruns INT, rbi INT, sb INT, cs INT, walks INT, strikeouts INT, ibb INT, hbp INT, sh INT, sf INT, gidp INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/hivetest/batting'; CREATE EXTERNAL TABLE IF NOT EXISTS master (id STRING, byear INT, bmonth INT, bday INT, bcountry STRING, bstate STRING, bcity STRING, dyear INT, dmonth INT, dday INT, dcountry STRING, dstate STRING, dcity STRING, fname STRING, lname STRING, name STRING, weight INT, height INT, bats STRING, throws STRING, debut STRING, finalgame STRING, retro STRING, bbref STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/hivetest/master'; 。所以对于这个例子,我想要推文:

# pillow #pillow

但如果我使用API​​跟踪#mybedpillow,我只能与#mypillowbed完全匹配

如果我#pillow,我会收到#pillow的推文,但不会发送任何文字。

我现在看到的唯一方法是流式传输随机推文,然后根据与我的情况匹配的主题标签对其进行过滤。收集我需要的数据需要更长的时间。有什么想法吗?

1 个答案:

答案 0 :(得分:-1)

此主题可能对您的工作有所帮助。正则表达式可以解决您的问题:Best HashTag Regex