按时间分组,然后只有在列表中存在这些条目时才计算唯一条目[Panda]

时间:2017-11-12 14:11:04

标签: python pandas dataframe

请考虑以下熊猫数据框“df”和python列表“my_list”如下。

df =

timestamp  address    type
1           1          A
2           9          B
3           3          A
4           6          B
5           6          B
6           2          B
7           3          A
8           2          B
9           1          B
10          3          A
11          3          A
12          3          A

my_list =

[1, 2, 3]

现在我想要的是在3秒bin中按时间戳列对数据帧进行分组,并且只有在“my_list”中存在地址时才计算唯一“类型”的数量。

预期输出应如下所示:

timestamp   A    B    
1           2    0 #One "B" ignored, because address=9 is not in my_list
4           0    1 #Two "B" ignored because address is not in "my_list
7           1    2 #Two "B" with unique addresses, and one "A"
10          1    0 #Three rows with Type="A", but addresses are is same.

请注意,时间戳值最初采用时间戳格式,我们可以应用df.groupby和pd.TimeGrouper函数在3秒内对行进行分组。

  

只欣赏基于Pandas(Python)的答案。

对任何混淆道歉。我试着保持简单。

- 汗

3 个答案:

答案 0 :(得分:4)

使用:

#convert index to triples
df.index = df.index // 3
#filter rows by condition
df1 = df[df['address'].isin(my_list)]
#get unique numbers and reshape
df1 = df1['address'].groupby([df1.index, df1['type']]).nunique().unstack(fill_value=0)
#add timestamps
df1.index = df['timestamp'].groupby(df.index).first()
print (df1)
type       A  B
timestamp      
1          2  0
4          0  1
7          1  2
10         1  0

设定:

print (df)
    timestamp  address type
0           1        1    A
1           2        9    B
2           3        3    A
3           4        6    B
4           5        6    B
5           6        2    B
6           7        3    A
7           8        2    B
8           9        1    B
9          10        3    A
10         11        3    A
11         12        3    A

使用datetimes的解决方案更简单:

#sample datetimes 
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='D',
                   origin=pd.Timestamp('2017-01-01'))

print (df)
    timestamp  address type
0  2017-01-02        1    A
1  2017-01-03        9    B
2  2017-01-04        3    A
3  2017-01-05        6    B
4  2017-01-06        6    B
5  2017-01-07        2    B
6  2017-01-08        3    A
7  2017-01-09        2    B
8  2017-01-10        1    B
9  2017-01-11        3    A
10 2017-01-12        3    A
11 2017-01-13        3    A

df1 = df[df['address'].isin(my_list)]
df1 = (df1.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
          .nunique()
          .unstack(fill_value=0) )
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0

一行解决方案:

df1 = (df.query("address in @my_list")
         .groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
         .nunique()
         .unstack(fill_value=0))
print (df1)
type        A  B
timestamp       
2017-01-02  2  0
2017-01-05  0  1
2017-01-08  1  2
2017-01-11  1  0

答案 1 :(得分:3)

通过创建参考列pivot_table

来实现一种方法
# Group every three column by finding %3 and cumcount 
df['temp'] = df.groupby([df['timestamp']%3]).cumcount()

# Replace the values absent in list by nan
df['add'] = df['address'].where(df['address'].isin(li),np.nan)

# Create the index of time stamp whos mod value is 1 
idx = df['timestamp'][df['timestamp']%3==1]  

# Pivot table with agg function number of unqiue values based on newly created columns, fill nan with 0. 
ndf = df.pivot_table('add','type','temp',aggfunc='nunique',fill_value=0).T.set_index(idx)

输出:

type       A  B
timestamp      
1          2  0
4          0  1
7          1  2
10         1  0

答案 2 :(得分:3)

使用pd.get_dummies

grps = df.timestamp.sub(1).floordiv(3).mul(3).add(1)
dups = df[['address', 'type']].assign(grps=grps).duplicated().values
inmy = df.address.isin(my_list).values

pd.get_dummies(df.set_index(grps)[inmy & ~dups].type).sum(level=0).reset_index()

   timestamp  A  B
0          1  2  0
1          4  0  1
2          7  1  2
3         10  1  0