出现次数为

时间:2017-07-23 17:56:49

标签: python-2.7 pandas dataframe time-series

我试图计算两个带时间戳的值之间的计数:

例如:

time    letter
  1     A
  4     B
  5     C
  9     C
  18    B
  30    A
  30    B

我正在划分时间窗口:1 + 30/30 那么我想知道每个大小为1的时间窗口有多少A B C

timeseries  A  B  C
1           1  0  0
2           0  0  0
...
30          1  1  0

这个shoud给我一张30行3列的表:A b C of ocurancess

问题是数据需要很长时间才能分解,因为每次数据已经排序时,它会遍历所有主表以切片数据

master = mytable  

minimum = master.timestamp.min()
maximum = master.timestamp.max()

window = (minimum + maximum) / maximum

wstart = minimum
wend = minimum + window

concurrent_tasks = []

while ( wstart <= maximum ):
    As = 0
    Bs = 0
    Cs = 0
    for d, row in master.iterrows():
        ttime = row.timestamp
        if ((ttime >= wstart) & (ttime < wend)):
            #print (row.channel)
            if (row.channel == 'A'):
                As = As + 1
            elif (row.channel == 'B'):
                Bs = Bs + 1
            elif (row.channel == 'C'):
                Cs = Cs + 1


    concurrent_tasks.append([m_id, As, Bs, Cs])

    wstart = wstart + window
    wend = wend + window

你能帮我改善表现吗?我想使用map函数,我希望每次都阻止python循环遍历所有循环。

这是大数据的一部分,需要数天才能完成?

谢谢

1 个答案:

答案 0 :(得分:3)

有一种更快的方法 - pd.get_dummies()

In [116]: pd.get_dummies(df.set_index('time')['letter'])
Out[116]:
      A  B  C
time
1     1  0  0
4     0  1  0
5     0  0  1
9     0  0  1
18    0  1  0
30    1  0  0
30    0  1  0

如果你想&#34;压缩&#34; (分组)time

In [146]: pd.get_dummies(df.set_index('time')['letter']).groupby(level=0).sum()
Out[146]:
      A  B  C
time
1     1  0  0
4     0  1  0
5     0  0  1
9     0  0  1
18    0  1  0
30    1  1  0

或使用sklearn.feature_extraction.text.CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(token_pattern=r"\b\w+\b", stop_words=None)

r = pd.SparseDataFrame(cv.fit_transform(df.groupby('time')['letter'].agg(' '.join)),
                       index=df['time'].unique(),
                       columns=df['letter'].unique(),
                       default_fill_value=0)

结果:

In [143]: r
Out[143]:
    A  B  C
1   1  0  0
4   0  1  0
5   0  0  1
9   0  0  1
18  0  1  0
30  1  1  0

如果我们要列出times1的所有30

In [153]: r.reindex(np.arange(r.index.min(), r.index.max()+1)).fillna(0).astype(np.int8)
Out[153]:
    A  B  C
1   1  0  0
2   0  0  0
3   0  0  0
4   0  1  0
5   0  0  1
6   0  0  0
7   0  0  0
8   0  0  0
9   0  0  1
10  0  0  0
11  0  0  0
12  0  0  0
13  0  0  0
14  0  0  0
15  0  0  0
16  0  0  0
17  0  0  0
18  0  1  0
19  0  0  0
20  0  0  0
21  0  0  0
22  0  0  0
23  0  0  0
24  0  0  0
25  0  0  0
26  0  0  0
27  0  0  0
28  0  0  0
29  0  0  0
30  1  1  0

或使用Pandas方法:

In [159]: pd.get_dummies(df.set_index('time')['letter']) \
     ...:   .groupby(level=0) \
     ...:   .sum() \
     ...:   .reindex(np.arange(r.index.min(), r.index.max()+1), fill_value=0)
     ...:
Out[159]:
      A  B  C
time
1     1  0  0
2     0  0  0
3     0  0  0
4     0  1  0
5     0  0  1
6     0  0  0
7     0  0  0
8     0  0  0
9     0  0  1
10    0  0  0
...  .. .. ..
21    0  0  0
22    0  0  0
23    0  0  0
24    0  0  0
25    0  0  0
26    0  0  0
27    0  0  0
28    0  0  0
29    0  0  0
30    1  1  0

[30 rows x 3 columns]

<强>更新

定时:

In [163]: df = pd.concat([df] * 10**4, ignore_index=True)

In [164]: %timeit pd.get_dummies(df.set_index('time')['letter'])
100 loops, best of 3: 10.9 ms per loop

In [165]: %timeit df.set_index('time').letter.str.get_dummies()
1 loop, best of 3: 914 ms per loop