请考虑以下熊猫数据框“df”和python列表“my_list”如下。
df =
timestamp address type
1 1 A
2 9 B
3 3 A
4 6 B
5 6 B
6 2 B
7 3 A
8 2 B
9 1 B
10 3 A
11 3 A
12 3 A
my_list =
[1, 2, 3]
现在我想要的是在3秒bin中按时间戳列对数据帧进行分组,并且只有在“my_list”中存在地址时才计算唯一“类型”的数量。
预期输出应如下所示:
timestamp A B
1 2 0 #One "B" ignored, because address=9 is not in my_list
4 0 1 #Two "B" ignored because address is not in "my_list
7 1 2 #Two "B" with unique addresses, and one "A"
10 1 0 #Three rows with Type="A", but addresses are is same.
请注意,时间戳值最初采用时间戳格式,我们可以应用df.groupby和pd.TimeGrouper函数在3秒内对行进行分组。
只欣赏基于Pandas(Python)的答案。
对任何混淆道歉。我试着保持简单。
- 汗
答案 0 :(得分:4)
使用:
#convert index to triples
df.index = df.index // 3
#filter rows by condition
df1 = df[df['address'].isin(my_list)]
#get unique numbers and reshape
df1 = df1['address'].groupby([df1.index, df1['type']]).nunique().unstack(fill_value=0)
#add timestamps
df1.index = df['timestamp'].groupby(df.index).first()
print (df1)
type A B
timestamp
1 2 0
4 0 1
7 1 2
10 1 0
设定:
print (df)
timestamp address type
0 1 1 A
1 2 9 B
2 3 3 A
3 4 6 B
4 5 6 B
5 6 2 B
6 7 3 A
7 8 2 B
8 9 1 B
9 10 3 A
10 11 3 A
11 12 3 A
使用datetimes
的解决方案更简单:
#sample datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='D',
origin=pd.Timestamp('2017-01-01'))
print (df)
timestamp address type
0 2017-01-02 1 A
1 2017-01-03 9 B
2 2017-01-04 3 A
3 2017-01-05 6 B
4 2017-01-06 6 B
5 2017-01-07 2 B
6 2017-01-08 3 A
7 2017-01-09 2 B
8 2017-01-10 1 B
9 2017-01-11 3 A
10 2017-01-12 3 A
11 2017-01-13 3 A
df1 = df[df['address'].isin(my_list)]
df1 = (df1.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
.nunique()
.unstack(fill_value=0) )
print (df1)
type A B
timestamp
2017-01-02 2 0
2017-01-05 0 1
2017-01-08 1 2
2017-01-11 1 0
一行解决方案:
df1 = (df.query("address in @my_list")
.groupby([pd.Grouper(freq='3D', key='timestamp'), 'type'])['address']
.nunique()
.unstack(fill_value=0))
print (df1)
type A B
timestamp
2017-01-02 2 0
2017-01-05 0 1
2017-01-08 1 2
2017-01-11 1 0
答案 1 :(得分:3)
通过创建参考列pivot_table
即
# Group every three column by finding %3 and cumcount
df['temp'] = df.groupby([df['timestamp']%3]).cumcount()
# Replace the values absent in list by nan
df['add'] = df['address'].where(df['address'].isin(li),np.nan)
# Create the index of time stamp whos mod value is 1
idx = df['timestamp'][df['timestamp']%3==1]
# Pivot table with agg function number of unqiue values based on newly created columns, fill nan with 0.
ndf = df.pivot_table('add','type','temp',aggfunc='nunique',fill_value=0).T.set_index(idx)
输出:
type A B timestamp 1 2 0 4 0 1 7 1 2 10 1 0
答案 2 :(得分:3)
使用pd.get_dummies
grps = df.timestamp.sub(1).floordiv(3).mul(3).add(1)
dups = df[['address', 'type']].assign(grps=grps).duplicated().values
inmy = df.address.isin(my_list).values
pd.get_dummies(df.set_index(grps)[inmy & ~dups].type).sum(level=0).reset_index()
timestamp A B
0 1 2 0
1 4 0 1
2 7 1 2
3 10 1 0