如何计算来自twitter的每日单词频率?

时间:2015-11-01 19:52:08

标签: python regex twitter pandas word-count

我有这样的Twitter数据框,

>>>twitdata=pd.read_csv('D:\\twit-data.csv')
>>>twitdata

    tweet_id    user_id     user_name   t_date      t_time      tweets
    4.05323E+17 82142636    1nvestor    11/26/2013  8:12:00     Fidelity reports that $TSN stock gets called away. Position now closed.
    2.53585E+17 22042454    Kiplinger   10/3/2012   15:57:00    Did you know that every $100 bump in avg. home prices lifts consumer spending by $5? http://t.co/zXRbWJzR
    ...

我想计算一个特定单词的每日频率,比如iphone,并得到其每日频率的结果,如

      date    frequency
2011-01-01    530
2011-01-02    550
...

如何设计程序来实现这个目标?

2 个答案:

答案 0 :(得分:1)

我根据随机数据创建了一个数据框,但它应该让你知道如何从这里开始。我在日历日将计数设置为D,您可以根据需要更改offset

import pandas as pd
import io # only needed to import sample data

data = """
    date          tweet_id    tweet
    2015-10-31    50230       tweet_1
    2015-10-31    48646       tweet_2
    2015-10-31    48748       tweet_3
    2015-10-31    46992       tweet_4
    2015-11-01    46491       tweet_5
    2015-11-01    45347       tweet_6
    2015-11-01    45681       tweet_7
    2015-11-01    46430       tweet_8
    """

df = pd.read_csv(io.StringIO(data), delimiter='\s+', \
                 index_col=False, parse_dates = ['date'])

# Tweet count starts here
df_count = df.set_index('date').resample('D', how='count') # 'D' for offset calendar day
df_count = df_count.drop(df_count.columns[1:], axis=1)
df_count.columns = ['count']

print(df)

只是检查原始df的内容

        date  tweet_id    tweet
0 2015-10-31     50230  tweet_1
1 2015-10-31     48646  tweet_2
2 2015-10-31     48748  tweet_3
3 2015-10-31     46992  tweet_4
4 2015-11-01     46491  tweet_5
5 2015-11-01     45347  tweet_6
6 2015-11-01     45681  tweet_7
7 2015-11-01     46430  tweet_8

我们使用resample

之后
print(df_count)

                count
date                 
2015-10-31          4
2015-11-01          4

答案 1 :(得分:-1)

我自己解决了这个问题,这是我的解决方案。

import operator
result = tweetdata.groupby('t_date').first();
    allFreq={}
    for date in range(0,result.shape(0)):
        df=tweetdata[tweetdata.t_date==result.index[date]].ix[:,['t_date','tweets']]
        #type(tweetdata.loc[1,'t_date'])
    A=''
    for i in df.index:
        A=A+' '+df.ix[i,1]
    text_file = open("A.txt", "w+")
    text_file.write("%s" % A)
    text_file.close()
    with open('A.txt') as f:
        words = f.read()
        wordfreq = {}
        for word in words.replace(',', ' ').split():
            wordfreq[word] = wordfreq.setdefault(word, 0) + 1

    x = wordfreq
    sorted_x = sorted(x.items(), key=operator.itemgetter(1),reverse=True)
    sorted_x
    allFreq[result.index[date]]=sorted_x
>>>allFreq['2012-06-01']
>>>     [('the', 248),
         ('to', 201),
         ('of', 143),
         ('a', 137),
         ('in', 127),
         ('and', 107),
         ('for', 100),
         ('you', 95),
         ('is', 93),
         ('I', 81),
...]