我正在尝试根据一些条件对数据帧进行分组。
数据框:
Start Date End Date value
1971-07-01 1971-07-31 0.0
1971-08-01 1971-08-31 0.25
1971-09-01 1971-09-30 -0.62
1971-10-01 1971-10-31 0.0
1971-11-01 1971-11-30 -0.63
1971-12-01 1971-12-31 -1.0
1972-01-01 1972-01-31 0.0
1972-02-01 1972-02-29 0.0
1972-03-01 1972-03-31 2.0
1972-04-01 1972-04-30 0.0
.
.
1973-07-01 1973-07-31 2.0
1973-08-01 1973-08-31 0.5
1973-09-01 1973-09-30 -2.0
1973-10-01 1973-10-31 0.0
1973-11-01 1973-11-30 0.0
1973-12-01 1973-12-31 0.0
1974-01-01 1974-01-31 0.0
1974-02-01 1974-02-28 0.0
.
.
.
1974-11-01 1974-11-30 0.0
1974-12-01 1974-12-31 -1.25
1975-01-01 1975-01-31 -1.0
1975-02-01 1975-02-28 -1.0
1975-03-01 1975-03-31 -0.5
1975-04-01 1975-04-30 -0.25
1975-05-01 1975-05-31 0.0
1975-06-01 1975-06-30 1.25
1975-07-01 1975-07-31 0.0
1975-08-01 1975-08-31 0.0
分组条件
该组应始终以负值开头
只要我们具有负值,该组就会继续
如果达到正值或三个连续的零,则组结束。
上述数据框中的示例1
1971-09-01 1971-09-30 -0.62
1971-10-01 1971-10-31 0.0
1971-11-01 1971-11-30 -0.63
1971-12-01 1971-12-31 -1.0
1972-01-01 1972-01-31 0.0
1972-02-01 1972-02-29 0.0
示例2(在这种情况下,我们达到了3个连续的零)
1973-09-01 1973-09-30 -2.0
1973-10-01 1973-10-31 0.0
1973-11-01 1973-11-30 0.0
1973-12-01 1973-12-31 0.0
示例3(在这种情况下,我们达到了正值)
1974-12-01 1974-12-31 -1.25
1975-01-01 1975-01-31 -1.0
1975-02-01 1975-02-28 -1.0
1975-03-01 1975-03-31 -0.5
1975-04-01 1975-04-30 -0.25
1975-05-01 1975-05-31 0.0
我没有任何代码,因为我仍在寻找如何将条件放入groupby或任何其他有效的方式来执行此操作。
我尝试过循环,但是我不会去任何地方。
for i in df.index:
no = 0
if df['Value'][i] < 0:
df['groupno'] = no
分组后,我想获取组第一列的开始日期和组最后一列的结束日期。
预期结果(来自示例):
Start Date End Date
1971-09-01 1972-02-29
1973-09-01 1973-12-31
1974-12-01 1975-05-31
感谢阅读。
答案 0 :(得分:0)
我认为这不是pythonic方式,但是它可以工作,并且我认为对您有帮助。
groups = []
start = '' # start date for group
end = '' # end date for group
nulls = 0 # count of nulls
for j,i in df.iterrows():
# if it's first negativa value - start the group
if i.value < 0 and start == '':
start = i['Start Date']
nulls = 0
# if it's null - remember that
if i.value == 0:
nulls += 1
else:
nulls = 0
# if value > 0 or we have seen 3 nulls - end group (if it was start)
if ( (i.value > 0) or (nulls == 3) ) and start != '':
# if we have seen 3 nulls - we want write this end date (not previous)
if nulls == 3:
end = i['End Date']
groups.append((start, end))
start = ''
nulls = 0
if nulls == 3:
start = ''
nulls = 0
# remember previous end date
end = i['End Date']
result = pd.DataFrame(groups, columns = ['Start Date', 'End Date'])
print(result)
它不是group by
,但可以帮助您找到组的开始和结束日期。
出局:
Start Date End Date
0 1971-09-01 1972-02-29
1 1973-09-01 1973-12-31
2 1974-12-01 1975-05-31