邪恶的熊猫枢纽与多索引,正则表达式

时间:2018-08-28 17:28:30

标签: python regex pandas multi-index

我已经尝试了pandas.melt和.stack和.pivot的每种组合,但是都没有取得任何进展。

我有一个常规格式的excel表:

^(\.\.\/(?:\.\.\/)*)?(?!.*?\/\/)(?!(?:.*\/)?\.+(?:\/|$)).+$

我正在尝试处理此数据,因此我可以创建以下条形图:

  • x轴1997年-2000年
  • 左y轴为计数N
  • 右y轴为%
我用来构建df的Excel工作表中的

代码:

                1997           1998     1999        2000
Total, N (%)    3350 (34)   3387 (33)   4778 (33)   3588 (33)
Age category N, (%)             
  A             231 (24)    227 (24)    222 (23)    211 (22)
  B             492 (24)    481 (24)    487 (24)    405 (24)
  C             759 (28)    759 (27)    746 (26)    746 (26)
  D            1901 (45)    1873 (44)   1233 (44)   1903 (44)
Sex, N (%)              
  F            1650 (33)    1493 (33)   1673 (33)   1628 (32)
  M            1734 (35)    1794 (34)   1705 (34)   1760 (34)
Diet                
  Vegan        1553 (32)    1442 (31)   1453 (31)   1422 (31)
  Carnivore    1857 (36)    1063 (36)   1225 (35)   1926 (34)
Favorite movie              
  horror       1036 (24)    1033 (24)   1458 (24)   1742 (24)
  romance       732 (41)    743 (40)    735 (40)    799 (38)
  comedy        514 (34)    498 (32)    518 (32)    496 (32)
  silent        1110 (47)   1933 (47)   1967 (46)   1751 (46)
* Percents are in relation to 100% of children who filled out survey                

我有2个问题 我希望透视表,以便“年龄”,“性别”,“饮食”和“喜欢的电影”行是多索引列,每个类别下都有类别,而年份则作为行(观察) 因此最终产品看起来像:

import pandas as pd

categories = ['Age category N, (%)', 'Sex, N (%)', 'Diet', 'Favorite movie']
subcategories = ['A','B','C','D','F','M',"Vegan","Carnivore",'horror','romance','comedy','silent']


df = pd.DataFrame(
    {'1997':    [33850 (34), NaN ,231 (24),  492 (24), 759 (28), 1901 (45), NaN , 1650 (33),    1734 (35),NaN ,
             1553 (32), 1857 (36),NaN , 1036 (24),  732 (41),   514 (34),   1110 (47)],
    '1998': [33687 (33),NaN ,227 (24),  481 (24),   759 (27),   1873 (44),NaN ,1493 (33),   1794 (34),  NaN ,1442 (31), 1063 (36),NaN , 1033 (24),  743 (40),   498 (32),   1933 (47)],
    '1999': [3778 (33), NaN ,222 (23),  487 (24),   746 (26),   1233 (44),NaN   ,   1673 (33),  1705 (34),NaN , 1453 (31),  1225 (35),NaN   ,   1458 (24),  735 (40),   518 (32),   1967 (46)],
    '2000' : [3588 (33),NaN ,211 (22),  405 (24),   746 (26),   1903 (44),  NaN ,   1628 (32),  1760 (34),NaN , 1422 (31),  1926 (34),NaN , 1742 (24),  799 (38),   496 (32),   1751 (46)]},
        index  = pd.MultiIndex.from_tuples(
            [('Age category N, (%)','A'),('Age category N, (%)','B'),('Age category N, (%)','C'),
            ('Age category N, (%)', 'D'), ('Sex, N (%)', 'F'), ('Sex, N (%)', 'M'),
            ('Diet', 'Vegan'), ('Diet', 'Carnivore'),
            ('Favorite Movie','horror'),('Favorite Movie','romance'),('Favorite Movie','comedy'),('Favorite Movie','silent')],
            names = [categories, subcategories]))

问题的关键是我必须将'(%)'与计数分开,同时使其与数字“关联”(以告知正确的y轴)

任何指导都值得赞赏!

0 个答案:

没有答案