Python交叉表多个变量或行;人口统计表

时间:2019-03-14 17:03:36

标签: python pandas

问题

我有一个类似的问题:Crosstab with multiple items,但我没有尝试在R中做到这一点,我正在尝试使用Crosstab在Python Pandas中做到这一点。

我一直在尝试使用Python Pandas交叉表功能制作人口统计表,但是一次只能进行一次人口统计。换句话说,我想创建一个交叉表,使所有行变量处于同一级别。也许这不是交叉表的功能,而Pandas数据透视表之类的功能会更好呢?

当前,我使用以下三行代码,但会认为有某种方式可以将它们组合在一起:

genderTable = pd.crosstab(refQtrData['GENDER'], [refQtrData['FUNDINGSOURCE'],refQtrData['PROVIDER'],refQtrData['LOCATION']], margins='true')
raceTable = pd.crosstab(refQtrData['RACETH4'], [refQtrData['FUNDINGSOURCE'],refQtrData['PROVIDER'],refQtrData['LOCATION']], margins='true')
ageTable = pd.crosstab(refQtrData['REFERRED'], [refQtrData['FUNDINGSOURCE'],refQtrData['PROVIDER'],refQtrData['LOCATION']], values=refQtrData['AGEREF'], aggfunc='mean')

我想做什么: Demographic Table

其他杂项信息

这最初是使用以下代码在SPSS中完成的,但我正在尝试将其移至python。就像SPSS CTABLES允许我具有多个类别和变量一样,我希望有多个行对应于不同的变量,而不必位于不同的级别。

CTABLES
  /VLABELS VARIABLES= GENDER RACE AGE FUNDINGSOURCE PROVIDER LOCATION 
    DISPLAY=LABEL
  /TABLE REFERRED [C][COUNT F40.0] + GENDER [C][COUNT F40.0, COLPCT.COUNT PCTPAREN40.0] + RACE 
    [C][COUNT F40.0, COLPCT.COUNT PCTPAREN40.0] + AGE [S][MEAN] + AGE [S][MINIMUM, MAXIMUM]
    BY FUNDINGSOURCE [C] > PROVIDER [C] > LOCATION [C]
  /SLABELS VISIBLE=NO
  /CATEGORIES VARIABLES=GENDER RACE ORDER=A KEY=VALUE MISSING=INCLUDE EMPTY=INCLUDE
  /CATEGORIES VARIABLES=FUNDINGSOURCE ORDER=A KEY=VALUE MISSING=INCLUDE EMPTY=EXCLUDE
  /CATEGORIES VARIABLES=PROVIDER [1, 2] EMPTY=EXCLUDE 
  /CATEGORIES VARIABLES=LOCATION [1, 2] EMPTY=EXCLUDE.

1 个答案:

答案 0 :(得分:0)

在没有可复制的示例的情况下,我们可以依靠pandas交叉表文档,该文档在下面具有一些复制/粘贴的示例交叉表。

import pandas as pd
import numpy as np

a = np.array(["foo", "foo", "foo", "foo", "bar", "bar","bar", "bar", "foo", "foo", "foo"], dtype=object)
b = np.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"], dtype=object)
c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny", "shiny", "shiny"],dtype=object)
d = np.array(["1foo", "1foo", "1foo", "1foo", "1bar", "1bar","1bar", "1bar", "1foo", "1foo", "1foo"], dtype=object)

这给出了四个数组。制作交叉表。这将返回DataFrames。

df1 =  pd.crosstab(a, [b, c], rownames=['aa'], colnames=['b', 'c'])
df2 =  pd.crosstab(d, [b, c], rownames=['aa'], colnames=['b', 'c'])

使用pandas.concat([],axis=...)

跟踪数据框
pd.concat([df1, df2], axis=0)
b     one        two      
c    dull shiny dull shiny
aa                        
bar     1     2    1     0
foo     2     2    1     2
1bar    1     2    1     0
1foo    2     2    1     2

>>> pd.concat([df1, df2], axis=1)
b     one        two        one        two      
c    dull shiny dull shiny dull shiny dull shiny
1bar  NaN   NaN  NaN   NaN  1.0   2.0  1.0   0.0
1foo  NaN   NaN  NaN   NaN  2.0   2.0  1.0   2.0
bar   1.0   2.0  1.0   0.0  NaN   NaN  NaN   NaN
foo   2.0   2.0  1.0   2.0  NaN   NaN  NaN   NaN

就通过一个函数调用创建三个交叉表而言,实现一个接受数据并返回串联的交叉表的函数。不确定是否可以采用合理的单线方式完成。

然后留一个以进一步修改或以其他方式加入DataFrame。