有没有办法创建新列,表示包含两个日期时间之间的增量的各个月?输出可能是每个新月度列的二进制值。我在想这样的事情(它不起作用):
for i in [1, 2, 3, 4, 5]:
i_name = str(i)
values = example['end'] - example['start'] #Example line - need to expose values here)
example[i_name] = values
离开这个:
end name start
0 28/02/2012 joe bloggs 01/01/2012
1 15/03/2012 jane bloggs 01/02/2012
2 17/05/2012 jim bloggs 01/04/2012
3 18/04/2012 john bloggs 01/02/2012
对此:
end 1 2 3 4 5 name start
0 28/02/2012 1 1 0 0 0 joe bloggs 01/01/2012
1 15/03/2012 0 1 1 0 0 jane bloggs 01/02/2012
2 17/05/2012 0 0 0 1 1 jim bloggs 01/04/2012
3 18/04/2012 0 1 1 1 0 john bloggs 01/02/2012
答案 0 :(得分:3)
我认为你可以主要使用get_dummies
与stack
:
#convert columns to datetime
df['end'] = pd.to_datetime(df.end, dayfirst=True)
df['start'] = pd.to_datetime(df.start, dayfirst=True)
#print df
#get months to Series
end = df['end'].dt.month
start = df['start'].dt.month
#create difference DataFrame
df1 = pd.DataFrame({'end':end, 'start':start})
.apply(lambda x: pd.Series(range(x.start, x.end + 1)), axis=1)
print df1
0 1 2
0 1.0 2.0 NaN
1 2.0 3.0 NaN
2 4.0 5.0 NaN
3 2.0 3.0 4.0
#create indicator variables, sum values by index
df1 = pd.get_dummies(df1.stack().reset_index(level=1, drop=True))
.groupby(level=0).sum().astype(int)
#convert float columns names to int
df1.columns = df1.columns.to_series().astype(int)
print df1
1 2 3 4 5
0 1 1 0 0 0
1 0 1 1 0 0
2 0 0 0 1 1
3 0 1 1 1 0
#append to original DataFrame
print pd.concat([df, df1], axis=1)
end name start 1 2 3 4 5
0 2012-02-28 joe bloggs 2012-01-01 1 1 0 0 0
1 2012-03-15 jane bloggs 2012-02-01 0 1 1 0 0
2 2012-05-17 jim bloggs 2012-04-01 0 0 0 1 1
3 2012-04-18 john bloggs 2012-02-01 0 1 1 1 0
答案 1 :(得分:2)
这样可行:
example = pd.read_csv(FILE, parse_dates=[0, 2], dayfirst=True)
for i in [1, 2, 3, 4, 5]:
i_name = str(i)
example[i_name] = example.apply(lambda example: example["start"] <= pd.datetime(2012, i, 1) <= example["end"], axis=1).astype(int)
答案 2 :(得分:1)
首先,您必须使用pd.to_datetime
将日期列转换为日期时间:
import pandas as pd
example['end'] = pd.to_datetime(example['end'], dayfirst=True) #default is ydm...
example['start'] = pd.to_datetime(example['start'], dayfirst=True)
然后在你的for循环中你可以设置适当的值:
example[str(i)] = 0
example[str(i)][( i >= example['start'].dt.month) & (example['end'].dt.month >= i)] = 1
(从jezrael的回答中窃取dt.month
)导致:
import pandas as pd
example['end'] = pd.to_datetime(example['end'], dayfirst=True) #default is ydm...
example['start'] = pd.to_datetime(example['start'], dayfirst=True)
for i in range(1,13):
example[str(i)] = 0
example[str(i)][( i >= example['start'].dt.month) & (example['end'].dt.month >= i)] = 1
然后导致:
In[101]: example
Out[101]:
end name start 1 2 3 4 5 6 7 8 9 10 11 12
0 2012-02-28 joe bloggs 2012-01-01 1 1 0 0 0 0 0 0 0 0 0 0
1 2012-03-15 jane bloggs 2012-02-01 0 1 1 0 0 0 0 0 0 0 0 0
2 2012-05-17 jim bloggs 2012-04-01 0 0 0 1 1 0 0 0 0 0 0 0
3 2012-04-18 john bloggs 2012-02-01 0 1 1 1 0 0 0 0 0 0 0 0