我正在尝试创建一个日历,该日历在项目目录中汇总信息,并按时间顺序和项目类型进行组织。我一直在使用Pandas并且无法正确获得基本结构。例如,给定此数据集:
table(u)
table(v)
使用Remove none values from dataframe中显示的技巧,我可以创建字段来跟踪最终表格中每个项目的排名顺序:
Type Name Health Month Year
0 Marketing ProjectA OK Jan 2018
1 Science ProjectB Warning Apr 2018
2 Marketing ProjectC OK Mar 2018
3 Development ProjectD OK Feb 2018
4 Marketing ProjectE OK Jan 2018
5 Development ProjectF Warning Feb 2018
6 Development ProjectG Trouble May 2018
7 Marketing ProjectH Trouble May 2018
8 Development ProjectI Warning Feb 2018
9 Marketing ProjectJ OK May 2018
10 Science ProjectK Warning Apr 2018
产生2个额外的列:
df['aggval'] = df['Year'].map(str) + df['Month'] + df['Type']
df['index'] = df.groupby(['aggval']).cumcount()
通过这些提取列,我们现在可以进行透视以创建项目汇总表的初始版本:
Type Name Health Month Year aggval index
0 Marketing ProjectA OK Jan 2018 2018JanMarketing 0
1 Science ProjectB Warning Apr 2018 2018AprScience 0
2 Marketing ProjectC OK Mar 2018 2018MarMarketing 0
3 Development ProjectD OK Feb 2018 2018FebDevelopment 0
4 Marketing ProjectE OK Jan 2018 2018JanMarketing 1
5 Development ProjectF Warning Feb 2018 2018FebDevelopment 1
6 Development ProjectG Trouble May 2018 2018MayDevelopment 0
7 Marketing ProjectH Trouble May 2018 2018MayMarketing 0
8 Development ProjectI Warning Feb 2018 2018FebDevelopment 2
9 Marketing ProjectJ OK May 2018 2018MayMarketing 1
10 Science ProjectK Warning Apr 2018 2018AprScience 1
制作以下报告。这基本上是正确的:它收集并列出项目,显示他们的名称,并按类型(泳道)按时间顺序按年份和月份进行组织:
pv1 = pd.pivot_table(df, values='Name', index=['Type', 'index'], columns=['Year', 'Month'], aggfunc=lambda x: "".join(x)).fillna('')
pv1 = pv1.reindex(columns = zip(12 * [2018], ['Jan', 'Feb', 'Mar', 'Apr', 'May']))
我现在难以尝试扩展此模型以一起显示每个项目的名称和运行状况。
我可以在Health字段中添加第二个数据透视表值:
Year 2018
Month Jan Feb Mar Apr May
Type index
Development 0 ProjectD ProjectG
1 ProjectF
2 ProjectI
Marketing 0 ProjectA ProjectC ProjectH
1 ProjectE ProjectJ
Science 0 ProjectB
1 ProjectK
生产:
pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
# pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
这是一个正确的想法 - 项目健康和名称都显示在每个项目中,在正确的月份和右侧类型泳道中,但我希望它们与项目并排。重新索引列会在标题级别生成正确的结果,但会使用Nan值清除单元格:
Health Name
Year 2018 2018
Month Apr Feb Jan Mar May Apr Feb Jan Mar May
Type index
Development 0 OK Trouble ProjectD ProjectG
1 Warning ProjectF
2 Warning ProjectI
Marketing 0 OK OK Trouble ProjectA ProjectC ProjectH
1 OK OK ProjectE ProjectJ
Science 0 Warning ProjectB
1 Warning ProjectK
产生
pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
同样,结构现在正确,但单元格值不再显示项目特定数据。我错过了什么?
答案 0 :(得分:1)
IIUC,您只需要swaplevel
和sort_index
#pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'], columns=['Year', 'Month'], aggfunc={'Name':lambda x: "|".join(x), 'Health':lambda x: ":".join(x), }).fillna('')
pv2.swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1)
Out[220]:
Year 2018 \
Month Apr Feb Jan
Health Name Health Name Health Name
Type index
Development 0 OK ProjectD
1 Warning ProjectF
2 Warning ProjectI
Marketing 0 OK ProjectA
1 OK ProjectE
Science 0 Warning ProjectB
1 Warning ProjectK
Year
Month Mar May
Health Name Health Name
Type index
Development 0 Trouble ProjectG
1
2
Marketing 0 OK ProjectC Trouble ProjectH
1 OK ProjectJ
Science 0
1
#pv2.swaplevel(0,1,axis=1).swaplevel(1,2,axis=1).sort_index(axis=1).to_excel('aaaaaa.xlsx')
答案 1 :(得分:1)
pv2
按此顺序开始列:
In [35]: pv2.columns.tolist()
Out[35]:
[('Health', 2018, 'Apr'),
('Health', 2018, 'Feb'),
('Health', 2018, 'Jan'),
('Health', 2018, 'Mar'),
('Health', 2018, 'May'),
('Name', 2018, 'Apr'),
('Name', 2018, 'Feb'),
('Name', 2018, 'Jan'),
('Name', 2018, 'Mar'),
('Name', 2018, 'May')]
我们想要重新排列列以获得此订单:
In [36]: list(zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
Out[36]:
[(2018, 'Jan', 'Health'),
(2018, 'Jan', 'Name'),
(2018, 'Feb', 'Health'),
(2018, 'Feb', 'Name'),
(2018, 'Mar', 'Health'),
(2018, 'Mar', 'Name'),
(2018, 'Apr', 'Health'),
(2018, 'Apr', 'Name'),
(2018, 'May', 'Health'),
(2018, 'May', 'Name')]
每列由3元组表示。 reindex
可以对列列表重新排序,但不能更改3元组中项目的内部顺序。为此,请使用reorder_levels
:
In [37]: pv2 = pv2.reorder_levels(['Year','Month',0], axis=1)
In [38]: pv2.columns.tolist()
Out[38]:
[(2018, 'Apr', 'Health'),
(2018, 'Feb', 'Health'),
(2018, 'Jan', 'Health'),
(2018, 'Mar', 'Health'),
(2018, 'May', 'Health'),
(2018, 'Apr', 'Name'),
(2018, 'Feb', 'Name'),
(2018, 'Jan', 'Name'),
(2018, 'Mar', 'Name'),
(2018, 'May', 'Name')]
按照所需顺序获得级别后,您可以致电reindex
重新排序列(按顺序获取月份)。
import sys
import pandas as pd
pd.options.display.width = sys.maxsize
df = pd.DataFrame({'Health': ['OK', 'Warning', 'OK', 'OK', 'OK', 'Warning', 'Trouble', 'Trouble', 'Warning', 'OK', 'Warning'], 'Month': ['Jan', 'Apr', 'Mar', 'Feb', 'Jan', 'Feb', 'May', 'May', 'Feb', 'May', 'Apr'], 'Name': ['ProjectA', 'ProjectB', 'ProjectC', 'ProjectD', 'ProjectE', 'ProjectF', 'ProjectG', 'ProjectH', 'ProjectI', 'ProjectJ', 'ProjectK'], 'Type': ['Marketing', 'Science', 'Marketing', 'Development', 'Marketing', 'Development', 'Development', 'Marketing', 'Development', 'Marketing', 'Science'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018]})
df['index'] = df.groupby(['Year','Month','Type']).cumcount()
pv2 = pd.pivot_table(df, values=['Name', 'Health'], index=['Type', 'index'],
columns=['Year', 'Month'],
aggfunc={'Name':lambda x: "|".join(x),
'Health':lambda x: ":".join(x), }).fillna('')
pv2 = pv2.reorder_levels(['Year','Month',0], axis=1)
pv2 = pv2.reindex(columns = zip(10 * [2018], ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar', 'Apr', 'Apr', 'May', 'May'], ['Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name', 'Health', 'Name']))
print(pv2)
产量
Year 2018
Month Jan Feb Mar Apr May
Health Name Health Name Health Name Health Name Health Name
Type index
Development 0 OK ProjectD Trouble ProjectG
1 Warning ProjectF
2 Warning ProjectI
Marketing 0 OK ProjectA OK ProjectC Trouble ProjectH
1 OK ProjectE OK ProjectJ
Science 0 Warning ProjectB
1 Warning ProjectK
虽然有时您可能需要手动指定所需的顺序
列,这不是(必然)这些情况之一。你想要的订单是
自然日期顺序。因此,解析Year
和{...}对我们有利
Month
标记为实际日期(dtype datetime64[ns]
)。这解锁了熊猫的智能日期时间处理行为。
例如,如果我们使用日期列(即dtype pivot_table
列),datetime64[ns]
会自动为我们排序日期。
此外,我们可以方便地按顺序生成所有日历月 ,而无需手动输入日期:
dates = pd.date_range('2018-01-01', '2018-12-31', freq='MS')
我们可以轻松地将DatetimeIndex转换为2级MultiIndex年/月格式(用于演示目的):
pv2.index = pd.Index(pv2.index.strftime('%Y-%b')).str.split('-', expand=True)
例如,
import sys
import pandas as pd
pd.options.display.width = sys.maxsize
df = pd.DataFrame({'Health': ['OK', 'Warning', 'OK', 'OK', 'OK', 'Warning', 'Trouble', 'Trouble', 'Warning', 'OK', 'Warning'], 'Month': ['Jan', 'Apr', 'Mar', 'Feb', 'Jan', 'Feb', 'May', 'May', 'Feb', 'May', 'Apr'], 'Name': ['ProjectA', 'ProjectB', 'ProjectC', 'ProjectD', 'ProjectE', 'ProjectF', 'ProjectG', 'ProjectH', 'ProjectI', 'ProjectJ', 'ProjectK'], 'Type': ['Marketing', 'Science', 'Marketing', 'Development', 'Marketing', 'Development', 'Development', 'Marketing', 'Development', 'Marketing', 'Science'], 'Year': [2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018]})
df['Date'] = pd.to_datetime(df['Year'].astype('str')+df['Month'], format='%Y%b')
df['index'] = df.groupby(['Date','Type']).cumcount()
pv2 = pd.pivot_table(df, values=['Name', 'Health'], columns=['Type', 'index'],
index=['Date'],
aggfunc={'Name':lambda x: "|".join(x),
'Health':lambda x: ":".join(x), }).fillna('')
dates = pd.date_range('2018-01-01', '2018-12-31', freq='MS')
pv2 = pv2.reindex(dates, fill_value='')
pv2.index = pd.Index(pv2.index.strftime('%Y-%b')).str.split('-', expand=True)
pv2 = pv2.stack(0)
pv2 = pv2.T
print(pv2)
产量
2018 ...
Jan Feb Mar Apr May ... Aug Sep Oct Nov Dec
Health Name Health Name Health Name Health Name Health Name ... Health Name Health Name Health Name Health Name Health Name
Type index ...
Development 0 OK ProjectD Trouble ProjectG ...
1 Warning ProjectF ...
2 Warning ProjectI ...
Marketing 0 OK ProjectA OK ProjectC Trouble ProjectH ...
1 OK ProjectE OK ProjectJ ...
Science 0 Warning ProjectB ...
1 Warning ProjectK ...