我有以下熊猫数据框:
Shortcut_Dimension_4_Code Stage_Code
10225003 2
8225003 1
8225004 3
8225005 4
它是一个更大的数据集的一部分,我需要能够按月和年进行过滤。我需要从Shortcut_Dimension_4_Code列中大于9999999的值的前两位数字中提取会计年度,而对于小于或等于9999999的值从第一位数中提取财务值。该值需要加到“ 20”才能产生年份“ 20” +“ 8” = 2008 | “ 20” +“ 10” =2010。
该年份“ 2008、2010”需要与阶段代码值(1-12)结合起来,以产生一个月/年,即02/2010。
然后需要将日期02/2010从会计年度日期转换为日历年度日期,即会计年度日期:02/2010 =日历年度日期:08/2009。结果日期需要在新列中显示。最终的df最终看起来像这样:
Shortcut_Dimension_4_Code Stage_Code Date
10225003 2 08/2009
8225003 1 07/2007
8225004 3 09/2007
8225005 4 10/2007
我是熊猫和python的新手,可以使用一些帮助。我从这里开始:
Shortcut_Dimension_4_Code Stage_Code CY_Month Fiscal_Year
0 10225003 2 8.0 10
1 8225003 1 7.0 82
2 8225003 1 7.0 82
3 8225003 1 7.0 82
4 8225003 1 7.0 82
我使用.map和.str方法来生成此df,但在2008-2009财政年度,我一直无法弄清楚如何获得财政年度的权利。
答案 0 :(得分:0)
在下面的代码中,我假设Shortcut_Dimension_4_Code
是一个整数。如果是字符串,则可以将其转换或切片,如下所示:df['Shortcut_Dimension_4_Code'].str[:-6]
。在代码旁的注释中有更多解释。
只要您不必处理空值,该方法就应该起作用。
import pandas as pd
import numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
fiscal_month_offset = 6
input_df = pd.DataFrame(
[[10225003, 2],
[8225003, 1],
[8225004, 3],
[8225005, 4]],
columns=['Shortcut_Dimension_4_Code', 'Stage_Code'])
# make a copy of input dataframe to avoid modifying it
df = input_df.copy()
# numpy will help us with numeric operations on large collections
df['fiscal_year'] = 2000 + np.floor_divide(df['Shortcut_Dimension_4_Code'], 1000000)
# loop with `apply` to create `date` objects from available columns
# day is a required field in date, so we'll just use 1
df['fiscal_date'] = df.apply(lambda row: date(row['fiscal_year'], row['Stage_Code'], 1), axis=1)
df['calendar_date'] = df['fiscal_date'] - relativedelta(months=fiscal_month_offset)
# by default python dates will be saved as Object type in pandas. You can verify with `df.info()`
# to use clever things pandas can do with dates we need co convert it
df['calendar_date'] = pd.to_datetime(df['calendar_date'])
# I would just keep date as datetime type so I could access year and month
# but to create same representation as in question, let's format it as string
df['Date'] = df['calendar_date'].dt.strftime('%m/%Y')
# copy important columns into output dataframe
output_df = df[['Shortcut_Dimension_4_Code', 'Stage_Code', 'Date']].copy()
print(output_df)