条件创建(填充)一个列,该列必须处理数据帧中的行以匹配标准

时间:2018-04-06 05:48:18

标签: python pandas dataframe

我有一个数据框,其中包含几个日期时间值列和一些其他分类/连续列。 为了便于描述,我上传了数据帧的片段,还删除了实际的日期值以避免混乱。

enter image description here

我正在尝试创建一个列,在确定要在此新列中填充的内容之前,必须处理数据框中的行以匹配标准。

在这种情况下:

如果行的SECTOR AND BASE值与某些其他行中的相同值匹配 和 如果此/这些前面的END日期(具有SECTOR AND BASE等效于现在具有相同SECTOR AND BASE的行的行)行匹配数据帧中稍后阶段的行的START日期,然后用1填充,否则为0。 所以,基本上,我正在看这样的事情:

enter image description here

 BASE     SECTOR     START    END     CHECK
 S     DHHJJ    12/2/2018   13/3/2018   0
 B       DJH    12/3/2018   13/3/2018   0
 S      FHJDFJK 12/4/2018   13/3/2020   0
 B     FHJDG    12/5/2018   13/3/2021   0
 T       XYZ    23/03/2018  25/03/2018  1
 T      ABCD    12/1/2017   13/2/2017   0
 T      ABCD    1/2/2018    1/3/2018    1
 T      ABCD    1/3/2018    15/3/2018   1
 T       XYZ    12/1/2015   12/2/2015   0
 B       XYZ    15/5/2017   15/7/2017   1
 T       XYZ    12/2/2014   12/3/2014   0
 B       XYZ    15/7/2017   20/7/2017   0
 T     SFJUTEUI 12/2/2018   13/3/2018   0
 T      RUTI    12/3/2018   13/3/2019   0
 T      FDJTK   12/4/2018   13/3/2020   0
 B    FJURTUI   12/5/2018   13/3/2021   0
 T    RYURTI    12/6/2018   13/3/2022   0
 T     SFJUI    12/7/2018   13/3/2023   0
 T       XYZ    25/03/2018  30/03/2018  0
 T       XYZ    12/4/2018   12/4/2018   0
 T       XYZ    1/4/2016    1/5/2016    1
 T       XYZ    1/5/2016    5/5/2016    0
 T      ABCD    15/3/2018   31/3/2018   0

使用BASE条件的独家修正添加数据:

BASE    SECTOR  START       END       CHECK
   S    DHHJJ   12/2/2018   13/3/2018   0
   B    DJH    12/3/2018    13/3/2018   0
   S    FHJDFJK 12/4/2018   13/3/2020   0
   B    FHJDG   12/5/2018   13/3/2021   0
   T    XYZ 23/03/2018  25/03/2018  1
   T    ABCD    12/1/2017   13/2/2017   0
   B    ABCD    1/2/2018    1/3/2018    1
   T    ABCD    1/3/2018    15/3/2018   1
   T    XYZ    12/1/2015    12/2/2015   0
   B    XYZ    15/5/2017    15/7/2017   1
   T    XYZ    12/2/2014    12/3/2014   0
   T    XYZ    15/7/2017    20/7/2017   0
   T    SFJUTEUI    12/2/2018   13/3/2018   0
   T    RUTI    12/3/2018   13/3/2019   0
   T    FDJTK   12/4/2018   13/3/2020   0
   B    FJURTUI 12/5/2018   13/3/2021   0
   T    RYURTI  12/6/2018   13/3/2022   0
   T    SFJUI   12/7/2018   13/3/2023   0
   T    XYZ   25/03/2018    30/03/2018  0
   T    XYZ    12/4/2018    12/4/2018   0
   T    XYZ     1/4/2016    1/5/2016    1
   B    XYZ     1/5/2016    5/5/2016    0
   B    ABCD    15/3/2018   31/3/2018   0

2 个答案:

答案 0 :(得分:1)

groupby的自定义功能用于检查成员身份,并排除具有相同STARTEND日期的行。对于0, 1值,将布尔值转换为整数。

df[['START','END']] = df[['START','END']].apply(pd.to_datetime)

def f(x):
    #test all start datetimes, order is not important
    x['Check1'] = (x['END'].isin(x['START']) & (x['END'] != x['START'])).astype(int)
    return x

df = df.groupby(['BASE','SECTOR']).apply(f)
print (df)
   BASE    SECTOR      START        END  CHECK  Check1
0     S     DHHJJ 2018-12-02 2018-03-13      0       0
1     B       DJH 2018-12-03 2018-03-13      0       0
2     S   FHJDFJK 2018-12-04 2020-03-13      0       0
3     B     FHJDG 2018-12-05 2021-03-13      0       0
4     T       XYZ 2018-03-23 2018-03-25      1       1
5     T      ABCD 2017-12-01 2017-02-13      0       0
6     T      ABCD 2018-01-02 2018-01-03      1       1
7     T      ABCD 2018-01-03 2018-03-15      1       1
8     T       XYZ 2015-12-01 2015-12-02      0       0
9     B       XYZ 2017-05-15 2017-07-15      1       1
10    T       XYZ 2014-12-02 2014-12-03      0       0
11    B       XYZ 2017-07-15 2017-07-20      0       0
12    T  SFJUTEUI 2018-12-02 2018-03-13      0       0
13    T      RUTI 2018-12-03 2019-03-13      0       0
14    T     FDJTK 2018-12-04 2020-03-13      0       0
15    B   FJURTUI 2018-12-05 2021-03-13      0       0
16    T    RYURTI 2018-12-06 2022-03-13      0       0
17    T     SFJUI 2018-12-07 2023-03-13      0       0
18    T       XYZ 2018-03-25 2018-03-30      0       0
19    T       XYZ 2018-12-04 2018-12-04      0       0
20    T       XYZ 2016-01-04 2016-01-05      1       1
21    T       XYZ 2016-01-05 2016-05-05      0       0
22    T      ABCD 2018-03-15 2018-03-31      0       0

如果日期时间的排序对于支票会员资格很重要:

def f1(x):
    e = x['END']
    s = x['START']
    #for each start datetime test all next end datetimes
    m = {j[0]: (s.iloc[i+1:] == j[1]).any() for i,j in enumerate(e.items())}
    x['Check2'] = pd.Series(m).astype(int)
    return x

df = df.groupby(['BASE','SECTOR']).apply(f1)
print (df)

为了更好地看到差异,一个值发生了变化:

print (df.tail())
   BASE SECTOR       START         END  CHECK
18    T    XYZ  25/03/2018  30/03/2018      0
19    T    XYZ    5/5/2016   12/4/2018      0 <-changed value to 5/5/2016
20    T    XYZ    1/4/2016    1/5/2016      1
21    T    XYZ    1/5/2016    5/5/2016      0
22    T   ABCD   15/3/2018   31/3/2018      0


df = df.groupby(['BASE','SECTOR']).apply(f)
df = df.groupby(['BASE','SECTOR']).apply(f1)
print (df.tail())
   BASE SECTOR      START        END  CHECK  Check1  Check2
18    T    XYZ 2018-03-25 2018-03-30      0       0       0
19    T    XYZ 2016-05-05 2018-12-04      0       0       0
20    T    XYZ 2016-01-04 2016-01-05      1       1       1
21    T    XYZ 2016-01-05 2016-05-05      0       1       0
22    T   ABCD 2018-03-15 2018-03-31      0       0       0

答案 1 :(得分:1)

谢谢@Jezrael, 总结一下:这是解决方案:

import subprocess
import sys

add: str = sys.argv[1]
commit: str = sys.argv[2]
branch: str = sys.argv[3]


def run_command(command: str):
    print(command)
    process = subprocess.Popen(command.split(), stdout=subprocess.PIPE)
    print(str(process.args))
    if command.startswith("git push"):
        output, error = process.communicate()
    else:
        output, error = process.communicate()
    try:
        output = bytes(output).decode()
        error = bytes(error).decode()
        if not output:
            print("output: " + output)
        print("error: " + error)
    except TypeError:
        print()


def main():
    global add
    global commit
    global branch
    if add == "" or add == " ":
        add = "."
    if branch == "":
        branch = "master"
    print("add: '" + add + "' commit: '" + commit + "' branch: '" + branch + "'")

    command = "git add " + add
    run_command(command)

    commit = commit.replace(" ", "''")
    command = 'git commit -m "' + commit + '"'
    run_command(command)

    command = "git push origin " + branch
    run_command(command)


if __name__ == '__main__':
    main()