我目前正在开发一个项目,根据他们所属的某些类别的信息对推文进行分类。例如,一条带有关键词“我认为纽约应禁烟”的推文在“污染”类别中被归类为推文,并带有负面情绪。
我能够让情绪分析有所帮助,但需要一些帮助来创建一个类别数据库并将其链接到python。我也对其他解决方案持开放态度。
到目前为止我的代码如下:1)stream.py。我使用以下命令将实时twitter数据转换为文本文件:python stream.py> output.txt的
import oauth2 as oauth
import urllib2 as urllib
api_key = 'xx'
api_secret = 'xx'
access_token_key = 'x-x'
access_token_secret = 'x'
_debug = 0
oauth_token = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)
signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()
http_method = "GET"
http_handler = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)
'''
Construct, sign, and open a twitter request
using the hard-coded credentials above.
'''
def twitterreq(url, method, parameters):
req = oauth.Request.from_consumer_and_token(oauth_consumer,
token=oauth_token,
http_method=http_method,
http_url=url,
parameters=parameters)
req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)
headers = req.to_header()
if http_method == "POST":
encoded_post_data = req.to_postdata()
else:
encoded_post_data = None
url = req.to_url()
opener = urllib.OpenerDirector()
opener.add_handler(http_handler)
opener.add_handler(https_handler)
response = opener.open(url, encoded_post_data)
return response
#locations=-74,40,-73,41
def fetchsamples():
url = "https://stream.twitter.com/1.1/statuses/filter.json?track=money&locations=-74,40,-73,41"
parameters = []
response = twitterreq(url, "POST", parameters)
for line in response:
print(line.strip())
if __name__ == '__main__':
fetchsamples()
推文的情绪计算为推文中每个词语的情绪分数之和。 运行:python tweet_sentiment.py AFINN-111.txt tweet_file获取推文情绪。
以下是我为AFINN-111.txt上传的链接。http://s000.tinyupload.com/index.php?file_id=62473255612293859764
以下是tweet_sentiment.py
的代码import sys
import json
import ast
import re
def calcScoreFromTerm(termScoreFile): # returns a dictionary with term-score values
scores ={}
for line in termScoreFile:
term, score = line.split("\t")
scores[term] = float(score)
return scores
def getTweetText(tweet_file): #returns a list of all tweets
tweets = []
for line in tweet_file:
# print line
jsondata = json.loads(line)
if "text" in jsondata.keys():
tweets.append(jsondata["text"])
tweet_file.close()
return tweets
def filterTweet(et):
# Remove punctuations and non-alphanumeric chars from each tweet string
pattern = re.compile('[^A-Za-z0-9]+')
et = pattern.sub(' ', et)
#print encoded_tweet
words = et.split()
# Filter unnecessary words
for w in words:
if w.startswith("RT") or w.startswith("www") or w.startswith("http"):
words.remove(w)
return words