Python - 基于100万行表上日期差异的矢量化条件变量和

时间:2018-06-10 17:49:22

标签: python pandas

我有以下pandas Dataframe:

Date                         Variable
2018-04-10 21:05:00             a
2018-04-10 21:05:00             a
2018-04-10 21:10:00             b
2018-04-10 21:15:00             a
2018-04-10 21:35:00             b
2018-04-10 21:45:00             a
2018-04-10 21:45:00             a

我的目标是计算包含30分钟前和30分钟的'a'行数 每次之后的几分钟(包括之前和之前相同时间的行) 之后,但不包括正在分析的每一行)。然后为每个人做同样的事情 Variable。因此对于Variable a我最终会得到以下内容:

Date                   nr_30_min_bef_a    nr_30_min_after_a   
2018-04-10 21:05:00           1                    2                             
2018-04-10 21:05:00           1                    2
2018-04-10 21:10:00           2                    1
2018-04-10 21:15:00           2                    2
2018-04-10 21:35:00           3                    2
2018-04-10 21:45:00           2                    1
2018-04-10 21:45:00           2                    1

我试图执行for循环迭代所有行,问题是这样 整个系列有超过一百万行,因此我正在寻找更多 有效的解决方案。

import pandas as pd

df = pd.DataFrame({'Date': ['2018-04-10 21:05:00',
                            '2018-04-10 21:05:00',
                            '2018-04-10 21:10:00',
                            '2018-04-10 21:15:00',
                            '2018-04-10 21:35:00',
                            '2018-04-10 21:45:00',
                            '2018-04-10 21:45:00'],
                   'Variable': ['a', 'a', 'b', 'a', 'b', 'a', 'a']})

提前致谢。

1 个答案:

答案 0 :(得分:2)

建立这个previous answer, 你可以用

callGetResources() {
  if (this.route.outlet === Constants.APP_USER) {
    this.resourcesService.getUser()
       .pipe(
          mergeMap(data => this.resourcesService.getResources(data.text())
       )
       .subscribe(resources => this.resources = resources);

打印

import pandas as pd

df = pd.DataFrame({'Date': ['2018-04-10 21:05:00',
                            '2018-04-10 21:05:00',
                            '2018-04-10 21:10:00',
                            '2018-04-10 21:15:00',
                            '2018-04-10 21:35:00',
                            '2018-04-10 21:45:00',
                            '2018-04-10 21:45:00'],
                   'Variable': ['a', 'a', 'b', 'a', 'b', 'a', 'a']})

df['Date'] = pd.to_datetime(df['Date'])

freq_table = pd.crosstab(index=df['Date'], columns=df['Variable'])
df_bef = freq_table.rolling('30T', closed='both').sum().astype(int)
is_current = (freq_table != 0).astype(int)
df_bef -= is_current
df_bef.columns = ['nr_30_min_bef_{}'.format(col) for col in df_bef.columns]
result = pd.merge(df, df_bef, left_on='Date', right_index=True)

max_date = df['Date'].max()
min_date = df['Date'].min()
pseudo_dates = (max_date - df['Date'])[::-1] + min_date
freq_table_reversed = pd.crosstab(index=pseudo_dates, columns=df['Variable'])
df_after = freq_table_reversed.rolling('30T', closed='both').sum().astype(int)
df_after = pd.DataFrame(df_after.values[::-1], index=freq_table.index, 
                       columns=df_after.columns)
df_after -= is_current
df_after.columns = ['nr_30_min_after_{}'.format(col) for col in df_after.columns]

result = pd.merge(result, df_after, left_on='Date', right_index=True)
print(result)

主要的新想法是使用 Date Variable nr_30_min_bef_a nr_30_min_bef_b nr_30_min_after_a nr_30_min_after_b 0 2018-04-10 21:05:00 a 1 0 2 2 1 2018-04-10 21:05:00 a 1 0 2 2 2 2018-04-10 21:10:00 b 2 0 1 1 3 2018-04-10 21:15:00 a 2 1 2 1 4 2018-04-10 21:35:00 b 3 1 2 0 5 2018-04-10 21:45:00 a 2 1 1 0 6 2018-04-10 21:45:00 a 2 1 1 0 生成频率表:

pd.crosstab

然后将每个滚动窗口中的数字相加:

freq_table = pd.crosstab(index=df['Date'], columns=df['Variable'])
# Variable             a  b
# Date                     
# 2018-04-10 21:05:00  2  0
# 2018-04-10 21:10:00  0  1
# 2018-04-10 21:15:00  1  0
# 2018-04-10 21:35:00  0  1
# 2018-04-10 21:45:00  2  0

由于您希望从计数中排除当前行,因此会从df_bef = freq_table.rolling('30T', closed='both').sum().astype(int) 中减去is_current

df_bef