我是Python的新手,开始使用Pandas代替MS Excel中完成的某些过程。
下面是我的问题描述
初始数据框:
Contract Id, Start date, End date
12378, '01-01-2018', '15-05-2018'
45679, '10-03-2018', '31-07-2018'
567982, '01-01-2018', '31-12-2020'
预期输出
Contract Id , Start date, End date, Jan-18,Feb-18,Mar-18,Apr-18,May-18...Dec-18
12378, '01-01-2018', '15-05-2018', 1, 1, 1, 1, 1, 0, 0, 0, 0, .....,0
45679, '10-03-2018', '31-07-2018', 0, 0, 1, 1, 1, 1, 1, 0, 0, 0....,0
567982,'01-01-2018', '31-12-2020', 1, 1, 1, 1.........………..., 1, 1, 1
如果合同在指定月份内处于活动状态,我想用Month Id作为列标题创建一组新列,并用标志(1或0)填充它们。
任何帮助将不胜感激。谢谢
答案 0 :(得分:1)
我也是熊猫新手。也许有更好的方法可以做到这一点,但这就是我所拥有的:
data['S_month'] = data['S'].apply(lambda x:int(x.split('-')[1]))
data['E_month'] = data['E'].apply(lambda x:int(x.split('-')[1]))
months = []
for s_e in data[['S_month','E_month']].values:
month = np.zeros(12)
month[s_e[0]-1:s_e[1]] = 1
months.append(month)
months = pd.DataFrame(months,dtype=int,columns=np.arange(1,13))
data.join(months)
或者您可以跳过前两行并执行以下操作:
months = []
for s_e in data[['S','E']].values:
month = np.zeros(12)
month[int(s_e[0].split('-')[1])-1:int(s_e[1].split('-')[1])] = 1
months.append(month)
months = pd.DataFrame(months,dtype=int,columns=np.arange(1,13))
data.join(months)
答案 1 :(得分:1)
这种方法使用了熊猫中非常丰富的日期功能,特别是PeriodIndex
import pandas as pd
import numpy as np
from io import StringIO
# Sample data (simplified)
df1 = pd.read_csv(StringIO("""
'Contract Id','Start date','End date'
12378,'01-02-2018','15-03-2018'
45679,'10-03-2018','31-05-2018'
567982,'01-01-2018','30-06-2018'
"""), quotechar="'", dayfirst=True, parse_dates=[1,2])
# Establish the month dates as a pandas PeriodIndex, using month end
dates = pd.period_range(df1['Start date'].min(), df1['End date'].max(), freq="M")
# create new dataframe with date matches with apply
# Match the start dates to the periods using the Period dates comparisons
# AND the result elementwise using numpy logial _nd
data = df1.apply(lambda r: pd.Series(np.logical_and(r[1] <= dates, r[2] >= dates).astype(int)), axis=1)
# Data frame with named month columns
df2 = pd.DataFrame(data)
df2.columns = dates
# Cooncatenate
result = pd.concat([df1, df2], axis=1)
result
# Contract Id Start date End date 2018-01 2018-02 2018-03 2018-04 2018-05 2018-06
#0 12378 2018-02-01 2018-03-15 0 1 1 0 0 0
#1 45679 2018-03-10 2018-05-31 0 0 1 1 1 0
#2 567982 2018-01-01 2018-06-30 1 1 1 1 1 1
答案 2 :(得分:0)
Pandas带有很多内置的日期/时间处理方法,可以在这里很好地利用它们。
# SETUP
# -----
import pandas as pd
# Initialize input dataframe
data = [
[12378, '01-01-2018', '15-05-2018'],
[45679, '10-03-2018', '31-07-2018'],
[567982, '01-01-2018', '31-12-2020'],
]
columns = ['Contract Id', 'Start date', 'End date']
df = pd.DataFrame(data, columns=columns)
# SOLUTION
# --------
# Convert strings to datetime objects
df['Start date'] = pd.to_datetime(df['Start date'], format='%d-%m-%Y')
df['End date'] = pd.to_datetime(df['End date'], format='%d-%m-%Y')
# For each month in year 2018 ...
for x in pd.date_range('2018-01', '2018-12', freq='MS'):
# Create a column with contract-active flags
df[x.strftime("%b-%y")] = (df['Start date'].dt.month <= x.month) & (x.month <= df['End date'].dt.month)
# Optional: convert True/False values to 0/1 values
df[x.strftime("%b-%y")] = df[x.strftime("%b-%y")].astype(int)
其结果为:
In [1]: df
Out[1]:
Contract Id Start date End date Jan-18 Feb-18 Mar-18 Apr-18 May-18 Jun-18 Jul-18 Aug-18 Sep-18 Oct-18 Nov-18 Dec-18
0 12378 2018-01-01 2018-05-15 1 1 1 1 1 0 0 0 0 0 0 0
1 45679 2018-03-10 2018-07-31 0 0 1 1 1 1 1 0 0 0 0 0
2 567982 2018-01-01 2020-12-31 1 1 1 1 1 1 1 1 1 1 1 1