Question

基本上，现在我有一些来自某些路由器（AP）的数据。路由器将每3秒探测用户的设备并向我们提供用户的MAC号（tag_mac）。

为了清理这些数据（因为在一段时间内，如果用户靠近其他的aps，不同的AP会让我们返回相同的tag_macs），我只需要每个都有最强信号的AP（由rssi表示） 10秒（只取平均值）。这是我的数据样本。


         ap_mac  rssi       tag_mac                time
0  048b422149fa   -63  a40dbc018db7 2017-07-01 08:00:00
1  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:00
2  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:00
3  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:00
4  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:00
5  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00
6  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00
7  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00
8  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00
9  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00

我需要的是一个过滤后的数据帧，其中我在每10秒的时间段内删除了所有行都有较弱的rssi。所以我剩下的是一个清理过的数据，对于每个tag_mac，我只有最强的rssi的ap_macs。

任何人都可以帮助我吗？谢谢！

Answer 1

我假设df为DataFrame

#this makes sure that the 'date' column is in the required format
df['time'] = pd.to_datetime(df['time'] , format='%Y-%m-%d %H:%M:%S')

new_df = pd.DataFrame(columns=['ap_mac','tag_mac','rssi','to','from'])

#start date - first date in the dataframe 'df'
start = pd.Timestamp(df.loc[0,'time'])

#end date is the last date in the dataframe 'df'
end = pd.Timestamp(df.loc[df.shape[0]-1,'time'])


upper = lower = start

indices_array =[]

while (end - upper >= pd.Timedelta(seconds=10)):

    upper = upper + pd.Timedelta(seconds=10)
    #data within a 10 second range is extracted into the variable data

    data = df[upper>df['time']][df['time']>=lower]

    for i in data['tag_mac'].unique():

        var = data.loc[data['tag_mac']==i].groupby('ap_mac').mean()
    #in the new_df rssi contains average values
        new_df = new_df.append({'rssi':var.max()[0],'ap_mac':var.idxmax()[0],'tag_mac':i,'to':upper,'from':lower},ignore_index=True)

    lower = upper

正如您所提到的，

您的庞大数据集会压缩到仅包含您需要的值的DataFrame new_df

我已添加到数据框to中的新列from和new_df，显示了读数存在的时间范围

new_df包含所有tag_mac及其对应的ap_mac s，每10秒钟采样最多平均值 rssi。

如果您遇到任何困难，请随时发表评论

Answer 2

我不知道我是否理解你的问题但你可以使用pandas Grouper如：

df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
result = df.groupby([pd.TimeGrouper(freq='10S'),'ap_mac','tag_mac']).mean().reset_index()
result.groupby(['time','tag_mac'])[['ap_mac','rssi']].max()

编辑：

我修改了你的表只是为了看看代码是如何工作的：

         ap_mac  rssi       tag_mac                time
0  048b422149fa   -63  a40dbc018db7 2017-07-01 08:00:00
1  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:10
2  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:15
3  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:00
4  048b4223e63d   -72  a40dbc018db7 2017-07-01 08:00:00
5  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00
6  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:30
7  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:12
8  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00
9  048b422149ff   -50  30b49e3715d0 2017-07-01 08:00:00

您希望按时间（每10秒）， ap_mac 和 tag_mac 进行分组。

您首先使用pd.to_datetime

将时间列转换为日期时间

df['time'] = pd.to_datetime(df['time'])

为了使用TimeGrouper，您将时间作为索引（仅适用于DateTimeIndex）

df = df.set_index('time')

并执行groupby以每10秒获取每个ap_mac的每个tag_mac的平均值。

result = df.groupby([pd.TimeGrouper(freq='10S'),'ap_mac','tag_mac']).mean().reset_index()

最后，

result.groupby(['time','tag_mac'])[['ap_mac', 'rssi']].max()

输出：

                                        ap_mac          rssi
time                    tag_mac         
2017-07-01 08:00:00     30b49e3715d0    048b422149ff    -50
                        a40dbc018db7    048b4223e63d    -63
2017-07-01 08:00:10     30b49e3715d0    048b422149ff    -50
                        a40dbc018db7    048b4223e63d    -72
2017-07-01 08:00:30     30b49e3715d0    048b422149ff    -50

熊猫如何分组一段时间，然后在群内过滤后取回一个df？

2 个答案: