如何在python pandas中处理这个复杂的逻辑?

时间:2016-08-12 05:19:15

标签: python-2.7 pandas dataframe

我有一些数据如跟随结构。它用在python pandas Data Frame中,我把它命名为df。

Data1,Data2,Flag
2016-04-29,00:40:15,1
2016-04-29,00:40:24,2
2016-04-29,00:40:35,2
2015-04-29,00:40:36,2
2015-04-29,00:40:43,2
2015-04-29,00:40:45,2
2015-04-29,00:40:55,1
2015-04-29,00:41:05,1
2015-04-29,00:41:16,1
2015-04-29,00:41:17,2
.....................
.....................
2016-11-29,11:52:36,2
2016-11-29,11:52:43,2
2016-11-29,11:52:45,2
2016-11-29,11:52:55,1

我希望数据符合以下要求。

  1. 如您所知,第一个数据的时间序列是2016-04-29,00:40:15。我想让这个数据帧中的下一个数据大于引物数据18秒。 我会得到第二个数据:2016-04-29,00:40:35,2 第三个数据是:2015-04-29,00:40:55,1
  2. 如果下一个数据的标记与引物的数据不同。无论是否已经过18秒,我都会得到这些数据。
  3. 对于上述两个要求,我将获得如下数据:

    Data1,Data2,Flag
    2016-04-29,00:40:15,1
    2016-04-29,00:40:24,2
    2015-04-29,00:40:43,2
    2015-04-29,00:40:55,1
    2015-04-29,00:41:16,1
    2015-04-29,00:41:17,2
    .....................
    

2 个答案:

答案 0 :(得分:2)

我构建了一个生成器来生成行,然后使用pd.concat

def get_row(df):
    ref = None
    for i, row in df.iterrows():
        if ref is not None:
            cond1 = (row.Data2.total_seconds() - 
                     ref.Data2.total_seconds() > 18)
            cond2 = row.Flag != ref.Flag
        if ref is None or cond1 or cond2:
            yield row
            ref = row

pd.concat([r for r in get_row(df)], axis=1).T

enter image description here

时序

因为@Kartik坚持: - )

enter image description here

答案 1 :(得分:2)

在这里,试试这个:

df['Data2'] = pd.to_timedelta(df['Data2'])

tdf = df.copy()
sel_idx = []
while len(tdf) > 0:
    sel_idx.extend([tdf.index[0]])
    cond1 = tdf['Data2'] > tdf.loc[sel_idx[-1], 'Data2'] + pd.to_timedelta(18, 's')
    cond2 = (tdf['Flag'] != tdf.loc[sel_idx[-1], 'Flag']) & (tdf['Data2'] > tdf.loc[sel_idx[-1], 'Data2'])
    tdf = tdf[cond1 | cond2]
df.loc[sel_idx, :]

测试

代码:

import pandas as pd
from io import StringIO

data = StringIO("""Data1,Data2,Flag
2016-04-29,00:40:15,1
2016-04-29,00:40:24,2
2016-04-29,00:40:35,2
2015-04-29,00:40:36,2
2015-04-29,00:40:43,2
2015-04-29,00:40:45,2
2015-04-29,00:40:55,1
2015-04-29,00:41:05,1
2015-04-29,00:41:16,1
2015-04-29,00:41:17,2
2016-11-29,11:52:36,2
2016-11-29,11:52:43,2
2016-11-29,11:52:45,2
2016-11-29,11:52:55,1""")

df = pd.read_csv(data)
df['Data2'] = pd.to_timedelta(df['Data2'])
print("Input\n", df)

tdf = df.copy()
sel_idx = []
while len(tdf) > 0:
    sel_idx.extend([tdf.index[0]])
    cond1 = tdf['Data2'] > tdf.loc[sel_idx[-1], 'Data2'] + pd.to_timedelta(18, 's')
    cond2 = (tdf['Flag'] != tdf.loc[sel_idx[-1], 'Flag']) & (tdf['Data2'] > tdf.loc[sel_idx[-1], 'Data2'])
    tdf = tdf[cond1 | cond2]
print("Ouput\n", df.loc[sel_idx, :])

输出:

Input
    Data1       Data2       Flag
0   2016-04-29  00:40:15    1
1   2016-04-29  00:40:24    2
2   2016-04-29  00:40:35    2
3   2015-04-29  00:40:36    2
4   2015-04-29  00:40:43    2
5   2015-04-29  00:40:45    2
6   2015-04-29  00:40:55    1
7   2015-04-29  00:41:05    1
8   2015-04-29  00:41:16    1
9   2015-04-29  00:41:17    2
10  2016-11-29  11:52:36    2
11  2016-11-29  11:52:43    2
12  2016-11-29  11:52:45    2
13  2016-11-29  11:52:55    1

Output
    Data1       Data2       Flag
0   2016-04-29  00:40:15    1
1   2016-04-29  00:40:24    2
4   2015-04-29  00:40:43    2
6   2015-04-29  00:40:55    1
8   2015-04-29  00:41:16    1
9   2015-04-29  00:41:17    2
10  2016-11-29  11:52:36    2
13  2016-11-29  11:52:55    1