我有一个带有日期时间索引的数据框。首先,这是我的假数据。
import pandas as pd
data1 = {'date' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]}
df1 = pd.DataFrame(data1)
df1 = df1.set_index('date')
df1.index = pd.to_datetime(df1.index).strftime('%Y-%m-%d %H:%M:%S')
我想做的是创建一个名为“ New_column”的新列,其类别变量为“是”或“否”取决于“数字”列中的值是否当天增加了至少20%
因此在此伪数据中,只有第二个值“ 12.85”将为“是”,因为它在时间戳“ 2019-02-19 19:45:00”上增加了23.35%
即使第一个值比第三个值大25%,但由于它将来会发生,因此不应计算。
完成此过程后,我应该在每天的最后一行的“ New_column”中添加NaN。
我一直在尝试许多不同的方法来使用: -pandas.DataFrame.pct_change -pandas.DataFrame.diff
如果有人有想法以pythonic的方式进行此操作,请帮帮我。
谢谢
答案 0 :(得分:1)
初始设置
data = {
'datetime' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
df = df.sort_values('datetime')
df['date'] = df['datetime'].dt.date
df['New_column'] = 'No'
找到当天晚些时候增长20%的所有行
indeces_true = set([])
for idx_low, row_low in df.iterrows():
for idx_high, row_high in df.iterrows():
if (row_low['date'] == row_high['date'] and
row_low['datetime'] < row_high['datetime'] and
row_low['number'] * 1.2 < row_high['number']):
indeces_true.add(idx_low)
# Assign 'Yes' for the true rows
for i in indeces_true:
df.loc[i, 'New_column'] = 'Yes'
# Last timestamp every day assigned as NaN
df.loc[df['date'] != df['date'].shift(-1), 'New_column'] = np.nan
# Optionally convert to categorical variable
df['New_column'] = pd.Categorical(df['New_column'])
输出
>>> df
datetime number date New_column
0 2019-02-19 09:31:00 18.6125 2019-02-19 No
1 2019-02-19 10:32:00 12.8500 2019-02-19 Yes
2 2019-02-19 17:12:00 14.8900 2019-02-19 No
3 2019-02-19 19:39:00 15.8301 2019-02-19 No
4 2019-02-19 19:45:00 15.8500 2019-02-19 NaN
5 2019-02-20 09:35:00 14.9160 2019-02-20 No
6 2019-02-20 09:36:00 14.9500 2019-02-20 NaN