在Pandas Dataframe中按半小时,小时和日分组推文

时间:2016-10-17 15:26:28

标签: python pandas dataframe tweepy sentiment-analysis

我正在使用Twitter数据进行情感分析项目,我遇到了关于日期的小问题。代码本身运行正常,但我不知道如何构建自定义时间块来分组我的最终数据。现在,默认是将它们分组为,第二个,这不是很有用。我希望能够在半小时,小时和天段中对它们进行分组......

随意跳到代码的底部,看看问题出在哪里!

以下是代码:

import tweepy
API_KEY = "XXXXX"
API_SECRET = XXXXXX"
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
import sklearn as sk
import pandas as pd
import got3
  #"Get Old Tweets" to find older data

tweetCriteria = got3.manager.TweetCriteria() 
tweetCriteria.setQuerySearch("Kentucky Derby")
tweetCriteria.setSince("2016-05-07") 
tweetCriteria.setUntil("2016-05-08")
tweetCriteria.setMaxTweets(1000)

TweetCriteria = got3.manager.TweetCriteria()
KYDerby_tweets = got3.manager.TweetManager.getTweets(tweetCriteria)

from afinn import Afinn
afinn = Afinn()
    #getting afinn library to use for sentiment polarity analysis

for x in KYDerby_tweets:
    Text = x.text
    Retweets = x.retweets
    Favorites = x.favorites
    Date = x.date
    Id = x.id
    print(Text)

AllText = []
AllRetweets = []
AllFavorites = []
AllDates = []
AllIDs = []
for x in KYDerby_tweets:
    Text = x.text
    Retweets = x.retweets
    Favorites = x.favorites
    Date = x.date
    AllText.append(Text)
    AllRetweets.append(Retweets)
    AllFavorites.append(Favorites)
    AllDates.append(Date)
    AllIDs.append(Id)

data_set = [[x.id, x.date, x.text, x.retweets, x.favorites] 
        for x in KYDerby_tweets]
df = pd.DataFrame(data=data_set, columns=["Id", "Date", "Text", "Favorites", "Retweets"])
    #I now have a DataFrame with my basic info in it

pscore = []
for x in KYDerby_tweets:
    afinn.score(x.text)
    pscore.append(afinn.score(x.text))
df['P Score'] = pscore
    #I now have the pscores for each Tweet in the DataFrame

nrc = pd.read_csv('C:\\users\\andrew.smith\\downloads\\NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt', sep="\t", names=["word", "emotion", "association"], skiprows=45)
    #import NRC emotion lexicon

nrc = nrc[nrc["association"]==1]
nrc = nrc[nrc["emotion"].isin(["positive", "negative"]) == False]
    #cleaned it up a bit

from nltk import TweetTokenizer
tt = TweetTokenizer()
tokenized = [x.lower() for x in tokenized]
    #built my Tweet-specific, NRC-ready tokenizer

emotions = list(set(nrc["emotion"]))
index2emotion = {}
emotion2index = {}

for i in range(len(emotions)):
    index2emotion[i] = emotions[i]
    emotion2index[emotions[i]] = i  
cv = [0] * len(emotions)
    #built indices showing locations of emotions

for token in tokenized:
    sub = nrc[nrc['word'] == token]
   token_emotions = sub['emotion']
   for e in token_emotions:
       position_index = emotion2index[e]
       cv[position_index]+=1

emotions = list(set(nrc['emotion']))
index2emotion = {}
emotion2index = {}
for i in range(len(emotions)):
    index2emotion[i] = emotions[i]
    emotion2index[emotions[i]] = i

def makeEmoVector(tweettext):
    cv = [0] * len(emotions)
    tokenized = tt.tokenize(tweettext)
    tokenized = [x.lower() for x in tokenized]
    for token in tokenized:
        sub = nrc[nrc['word'] == token]
        token_emotions = sub['emotion']
        for e in token_emotions:
            position_index = emotion2index[e]
            cv[position_index] += 1
    return cv

tweettext = df.iloc[14,:]['Text']

emotion_vectors = []

for text in df['Text']:
    emotion_vector = makeEmoVector(text)
    emotion_vectors.append(emotion_vector)

ev = pd.DataFrame(emotion_vectors, index=df.index, columns=emotions)
    #Now I have a DataFrame with all of the emotion counts for each tweet

Date_Group = df.groupby("Date")
Date_Group[emotions].agg("sum")
    #Finally, we arrive at the problem!  When I run this, I end up with tweets that are grouped *by the second.  What I want is to be able to group them: a) by the half-hour, b) by the hour, and c) by the day

1 个答案:

答案 0 :(得分:0)

因为,使用Tweepy API的推文的默认日期格式为" 2017-04-14 18:41:56"。要按小时分组推文,您可以做一些简单的事情:

# This will get the time parameter
time = [item.split(" ")[1] for item in df['date'].values] 

# This will get the hour parameter
hour = [item.split(":")[0] for item in time]

df['time'] = hour
grouped_tweets = df[['time', 'number_tweets']].groupby('time')
tweet_growth_hour = grouped_tweets.sum()
tweet_growth_hour['time']= tweet_growth_hour.index
print tweet_growth_hour

要按日期分组,您可以执行类似的操作:

days = [item.split(" ")[0] for item in df['date'].values]
df['days'] = days
grouped_tweets = df[['days', 'number_tweets']].groupby('days')
tweet_growth_days = grouped_tweets.sum()
tweet_growth_days['days']= tweet_growth_days.index
print tweet_growth_days