计算组大熊猫中唯一组合的平均值

时间:2015-08-05 14:40:47

标签: python pandas

我跟随pandas dataframe:

data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2,1,2,1,2,1,2,1]})

看起来像:

     A      B  C
0  foo    one  2
1  bar    one  1
2  foo    two  2
3  bar  three  1
4  foo    two  2
5  bar    two  1
6  foo    one  2
7  foo  three  1

我需要的是计算A和B的每个独特组合的平均值。即:

  A     B C
foo   one 2
foo   two 2
foo three 1

mean = 1.66666667

并且输出'mean'计算每A的值,即:

foo 1.666667
bar 1

我尝试过:

data.groupby(['A'], sort=False, as_index=False).mean()

但它让我回复:

foo 1.8
bar 1

有没有办法计算mean of only unique combinations?怎么样?

3 个答案:

答案 0 :(得分:1)

是。这是您想要的解决方案。首先,您创建组对应列以进行唯一组合A and B column。在制作小组后,您会计算mean()对应的A列。

您可以这样做:

from pandas import *
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' :[2.0,1,2,1,2,1,2,1]})
data = data.groupby(['A','B'], sort=False, as_index=False).mean()
print data.groupby('A', sort=False, as_index=False).mean()

输出:

     A         C
0  foo  1.666667
1  bar  1.000000

当您data.groupby(['A'], sort=False, as_index=False).mean()这样做时,它意味着您根据group_by计算C column A Column的所有值foo 1.8 (9/8) bar 1.0 (3/3) 。这就是它返回的原因

SELECT  
WORKORDER.WONUM,
WORKORDER.PARENT,
  WORKORDER.STATUS,
  TO_CHAR(WORKORDER.REPORTDATE, 'DD-MON-YY') AS REPORTDATE,
  TO_CHAR(WORKORDER.ACTSTART, 'DD-MON-YY')   AS ACTSTART,
  TO_CHAR(WORKORDER.ACTFINISH, 'DD-MON-YY')  AS ACTFINISH,
  WORKORDER.HASCHILDREN,
  WORKORDER.ACTLABCOST,
  WORKORDER.ACTMATCOST,
  WORKORDER.ACTTOOLCOST,
  WORKORDER.WOACCEPTSCHARGES,
  WORKORDER.EXT_JOBCODE,
  WORKORDER.WORKTYPE,
  WORKORDER.DESCRIPTION,
  WORKORDER.ACTSERVCOST,
  WORKORDER.EXT_DISTWORKTYPE,
  WORKORDER.LOCATION,
  LOCATIONS.EXT_OFFICE,
  LOCATIONS.EXT_STATECODE,
  WORKORDER.OWNERGROUP,
  CASE
    WHEN LOCATIONS.EXT_SRV_POLYGON IN ('BOF', 'CDA', 'COL', 'DAV', 'GOS', 'KEL', 'KLF', 'LAG', 'LEC', 'MED', 'PUM', 'RIT', 'ROS', 'SAN', 'SPO')
    THEN
      CASE
        WHEN WORKORDER.EXT_DISTWORKTYPE IN ('EC', 'ES', 'ET')
        THEN 'WRONG POLYGON'
        ELSE 'GAS'
      END
    WHEN LOCATIONS.EXT_SRV_POLYGON IN ('CDC', 'COC', 'DAC', 'DPC', 'GRC', 'KEC', 'LCC', 'OTC', 'PAC', 'SAC', 'SPC')
    THEN
      CASE
        WHEN WORKORDER.EXT_DISTWORKTYPE IN ('GC', 'GS', 'GT')
        THEN 'WRONG POLYGON'
        ELSE 'ELECTRIC'
      END
    ELSE 'MISSING'
  END                                           AS TYPE,
  TO_CHAR(WORKORDER.SCHEDSTART, 'DD-MON-YY')    AS SCHEDSTART,
  TO_CHAR(WORKORDER.SCHEDFINISH, 'DD-MON-YY')   AS SCHEDFINISH,
  TO_CHAR(WORKORDER.TARGCOMPDATE, 'DD-MON-YY')  AS TARGCOMPDATE,
  TO_CHAR(WORKORDER.TARGSTARTDATE, 'DD-MON-YY') AS TARGSTARTDATE,
  WORKORDER.REPORTEDBY
FROM WORKORDER
INNER JOIN LOCATIONS
ON WORKORDER.LOCATION   = LOCATIONS.LOCATION
WHERE ((WORKORDER.EXT_JOBCODE NOT LIKE 'A%') AND (WORKORDER.EXT_JOBCODE NOT LIKE 'B%') OR (WORKORDER.EXT_JOBCODE IS NULL))
AND WORKORDER.STATUS IN ('COMP', 'CLOSE') --COMMENT OUT FOR BLANKET WORKORDERS
--AND WORKORDER.WONUM LIKE 'B%' --FOR BLANKET WORKORDERS
AND WORKORDER.ACTFINISH > '01-FEB-15'--WORKORDER COMPLETED OR CLOSED INCLUDING WOS FROM CONVERSION THAT WERE OPEN / COMMENT OUT FOR BLANKET WOS
AND WORKORDER.SITEID = 'OPS'
--AND WORKORDER.EXT_DISTWORKTYPE IN ('EC','GC') --Only enable this line if I am running report for Lamont's request
--AND WORKORDER.ACTLABCOST != '0' --USED FOR TROUBLESHOOTING TO SEE LABORCOSTS ONLY
ORDER BY WORKORDER.WONUM;
--AND WORKORDER.EXT_JOBCODE NOT IN ('K008','K009','I006','I007','I008')--Per Rodeny not to worry about taking out these job codes since they are still being handled by gas and electric construction
--AND TO_CHAR(WORKORDER.ACTFINISH,'MM') =  TO_CHAR(SYSDATE,'MM')-1
--AND TO_CHAR(WORKORDER.ACTFINISH, 'YY') = TO_CH`enter code here`AR(SYSDATE,'YY')

我认为你应该找到你的答案:) :)

答案 1 :(得分:1)

这与@ S_A的答案基本相同,但更简洁一点。

您可以使用以下方法计算A B的均值:

In [41]: df.groupby(['A', 'B']).mean()
Out[41]: 
           C
A   B       
bar one    1
    three  1
    two    1
foo one    2
    three  1
    two    2

然后使用:

计算A以上的平均值
In [42]: df.groupby(['A', 'B']).mean().groupby(level='A').mean()
Out[42]: 
            C
A            
bar  1.000000
foo  1.666667

答案 2 :(得分:0)

这对我有用

test = data

test = test.drop_duplicates()
test = test.groupby(['A']).mean()

输出:

            C
  A            
bar  1.000000
foo  1.666667