Python Dataframe遍历行(比较它们之间的值)并准备一组作为输出

时间:2019-12-23 19:04:27

标签: python pandas dataframe

我有一个这样的数据框 我想按网址和状态对它们进行分组,并按日期对记录进行拆分,这是一种更有效的方法吗?

def transform_to_unique(df):
    test = []
    counter = 0

    #first_row
    if df.loc[0, 'status']!= df.loc[1, 'status']:
        counter = counter +1
    test.append(counter)

    for i in range(1, len(df)):

        if df.loc[i-1, 'url']!= df.loc[i, 'url']:
            counter=0

        if df.loc[i-1, 'status']!= df.loc[i, 'status'] :
            counter = counter +1
        test.append(counter)

    df['test'] = pd.Series(test)

    return df

df = transform_to_unique(frame)

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

ouptut from script

这是一个数据框:

1000,20191109,active
1000,20191108,inactive
2000,20191109,active
2000,20191101,inactive
351,20191109,active
351,20191102,active
351,20191026,active
351,20191019,active
351,20191012,active
351,20191005,active
351,20190928,inactive
351,20190921,inactive
351,20190914,inactive
351,20190907,active
351,20190831,active
351,20190615,inactive
3000,20200101,active
import pandas as pd
frame =pd.read_clipboard(sep=",", header=None)
frame.columns = ['url', 'date_scraped', 'status']

1 个答案:

答案 0 :(得分:1)

我不确定,test列是否正确显示您的标题,但这是您要实现的目标(基于您发布的示例数据):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["date_scraped_till"]=np.where(df["url"]==df["url"].shift(-1), 

df["date_scraped"].shift(-1), np.nan).astype(np.int32)

输出:

     url  date_scraped    status  date_scraped_till
15   351      20190615  inactive           20190831
14   351      20190831    active           20190907
13   351      20190907    active           20190914
12   351      20190914  inactive           20190921
11   351      20190921  inactive           20190928
10   351      20190928  inactive           20191005
9    351      20191005    active           20191012
8    351      20191012    active           20191019
7    351      20191019    active           20191026
6    351      20191026    active           20191102
5    351      20191102    active           20191109
4    351      20191109    active                  0
1   1000      20191108  inactive           20191109
0   1000      20191109    active                  0
3   2000      20191101  inactive           20191109
2   2000      20191109    active                  0
16  3000      20200101    active                  0

修改

如果您不是“分裂”而是“崩溃”,那应该可以解决问题(从根本上说,这是更有效的test列方法):

import numpy as np

df.sort_values(["url", "date_scraped"], axis=0, ascending=True, inplace=True)

df["test"]=np.where((df["url"]==df["url"].shift(1)) & (df["status"]==df["status"].shift(1)), 0,1)

df["test"]=df.groupby(["url", "status", "test"])["test"].cumsum().replace(to_replace=0, method='ffill')

df_g = df.groupby(['url', 'status', 'test'])['date_scraped'].agg({min, max})

输出:

                    max       min
url  status   test
351  active   1     20190907  20190831
              2     20191109  20191005
     inactive 1     20190615  20190615
              2     20190928  20190914
1000 active   1     20191109  20191109
     inactive 1     20191108  20191108
2000 active   1     20191109  20191109
     inactive 1     20191101  20191101
3000 active   1     20200101  20200101