Question

我有一个用户会话数据集，加载到Pandas DataFrame中：

SessionID, UserID, Logon_time, Logoff_time
Adx1YiRyvOFApQiniyPWYPo,AbO6vW58ta1Bgrqs.RA0uHg,2016-01-05 07:46:56.180,2016-01-05 08:04:36.057
AfjMzw8In8RDqK6jIfItZPs,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 14:48:47.183,2016-01-04 14:53:30.210
AYIdSJYsRw5PptkFfEOXPa0,AX3Xy8dRDBRAlhyy3YaWw6U,2016-01-04 11:06:37.040,2016-01-04 16:34:38.770
Ac.WXBBSl75KqEuBmNljYPE,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 10:58:04.227,2016-01-04 11:21:10.520
AekXRDR3mBBDh49IIN2HdU8,Ae8qOxLzozJHrC2pr2dOw88,2016-01-04 10:16:08.040,2016-01-04 10:34:20.523
AVvL3VSWSq5Fr.f4733X.T4,AX3Xy8dRDBRAlhyy3YaWw6U,2016-01-04 09:19:29.773,2016-01-04 09:40:25.157

我想要做的是将这些数据转换为包含两列的DataFrame：

时间戳/期间（例如分辨率为分钟）
当时存在的会话数

我可以通过将日期时间范围转换为Interval，然后检查给定时间戳落入时间间隔的行数来为单个时间戳执行此操作。

然而，如果我想这样做一两年，分辨率为分钟或小时，我最终将会有8760个循环（在几小时的情况下）一年...这可能不是一个交易破坏者，但我想知道是否有人有任何其他（可能更优雅）的建议或想法。

Answer 1

IIUC，我们可以这样做：

df.apply(lambda x: pd.Series([1] * len(pd.date_range(x.Logon_time, x.Logoff_time, freq='T')),
                             index=pd.date_range(x.Logon_time, x.Logoff_time, freq='T')), axis=1)\
  .stack().reset_index(level=0, drop=True).resample('T').count().plot()

输出（头）：

import javafx.application.Application;
import javafx.scene.Scene;
import javafx.scene.control.Label;
import javafx.scene.control.ScrollPane;
import javafx.scene.layout.VBox;
import javafx.stage.Stage;

public class Main extends Application {

    @Override
    public void start(Stage primaryStage){
        VBox vbox = new VBox();
        ScrollPane scrollPane = new ScrollPane();
        scrollPane.setPrefSize(300,300);
        Label promptLabel = new Label("Keep Scrolling!!");
        vbox.getChildren().add(promptLabel);
        vbox.setMinHeight(3000);
        scrollPane.setContent(vbox);
        Scene scene = new Scene(scrollPane);
        Stage stage = new Stage();
        stage.setScene(scene);
        stage.show();
    }

    public static void main(String[] args) { launch(args); }
}

使用Pandas可视化检查所有数据：

{{1}}

Answer 2

我最终使用的解决方案与斯科特的答案略有不同，但他的方法很关键，因为观察（记录）的数量相对较少，而另一方面，时间元素的数量（例如考虑到第一次和最后一次观察之间经过的时间，取决于所需的分辨率，秒数要大得多。

但是，我首先将所有生成的日期范围（系列）收集到一个列表中，并在第二个单独的步骤中连接所有这些，这样可以更快地使用In [75]: a = np.ones(100) In [76]: sizes = (len(a)*np.power(2.0, [-1, -2, -3, -4, -5, -6, -7]) + 0.5).astype(int) In [77]: sizes Out[77]: array([50, 25, 13, 6, 3, 2, 1]) In [78]: indices = np.arange(len(a)) In [79]: np.random.shuffle(indices) In [80]: start = 0 In [81]: for k in range(len(sizes)): ...: end = start + sizes[k] ...: a[indices[start:end]] = 0 ...: print(np.count_nonzero(a)) ...: start = end ...: 50 25 12 6 3 1 0不断修改原始Dataframe。

apply()

然后绘图只需要一个额外的声明：

# Expand the datetime range, creating records according to the given resolution (e.g. minutes).
# This creates a Series object for each session. All of those Series objects are then added to a list
# in order to concatenate them in 1 go, which is more efficient.
sessions=[]

for key, cols in df_sessions.iterrows():
    sess = pd.Series(data=pd.date_range(start=cols['logon'].floor('T'),
                                        end=cols['logoff'].ceil('T'),
                                        freq='T'),
                     name='sess_dt')
    sessions.append(sess)

# Concatenate all Series objects and convert to a DataFrame
df_sessions_2 = pd.DataFrame(pd.Series().append(sessions, ignore_index=True), columns=['ref_dt'])

# Add a counter which we can use to aggregate
df_sessions_2['sess_cnt'] = 1

# Aggregate according to the datetime
df_sessions_2 = df_sessions_2.groupby('ref_dt').sum()

计算用户会话数，定义为间隔

2 个答案: