Pandas运行小计过滤 - 申请和Lambda?

时间:2017-04-28 19:23:06

标签: python pandas split-apply-combine

我正在尝试构建一些东西,对于pandas数据库中的每条记录,它将显示给定列的总数,并且还显示给定列中某些记录之前发生的特定记录的总数。记录。

请注意,与所有记录的ENDDATE相比,当前记录的比较应为STARTDATE (仅限当前期间结束前的期间的利润)

我需要澄清一下,因为迭戈Amicabile在下面提出了一个非常漂亮的答案,遗憾的是我没有找到我需要的地方(我最初只用报告日期字段发布了这个问题)

enter image description here

所以在这个数据框中,我希望最后有两列。总利润(或sumall)和公司利润(或sumco)

Sumall,第一条记录为0,第二条记录为-500(2017-01-01之前的所有日期),第3条记录为300(-500 + 800)等

Sumco将是0,直到我们达到第二个IBM记录,这将是-500。它在第3条IBM记录中仍为-500,因为第二条记录(2017-03-03)的结束时间是在第3条记录的开始时间之后。

它应如下所示:

enter image description here 代码如下......我做错了什么,但无法弄清楚它是什么

import io
import pandas as pd

text = """CO         SECTOR    PROFIT   STARTMVYEAR TOTALPROFIT STARTDATE ENDDATE
IBM         TECHNOLOGY  -500    2500        500         2017-01-01 2017-01-01
APPLE       TECHNOLOGY   800    4000        300         2017-01-02 2017-01-03
GM          INDUSTRIAL   250    1000          0         2017-02-01 2017-02-03
IBM    INDUSTRIAL   600    3000        100         2017-03-01 2017-03-03
IBM    INDUSTRIAL   600    35000        100         2017-03-02 2017-06-01"""

df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0])

df['sumall'] = df.apply(lambda y:  df[df['ENDDATE'] < y['STARTDATE'] ].PROFIT.sum())
df['sumco'] = df.apply(lambda y:  df[(df['ENDDATE'] < y['STARTDATE'] )& (df.co==y.co)].PROFIT.sum())

错误如下:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()

pandas\src\hashtable_class_helper.pxi in 
pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)()

TypeError: an integer is required



C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4150                     if reduce is None:
   4151                         reduce = True
-> 4152                     return self._apply_standard(f, axis, reduce=reduce)
   4153             else:
   4154                 return self._apply_broadcast(f, axis)

C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4246             try:
   4247                 for i, v in enumerate(series_gen):
-> 4248                     results[i] = func(v)
   4249                     keys.append(v.name)
   4250             except Exception as e:

    <ipython-input-13-92e1d7684747> in <lambda>(y)
KeyError                                  Traceback (most recent call last)
<ipython-input-13-92e1d7684747> in <module>()
     11 df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0])
     12 
---> 13 df['sumall'] = df.apply(lambda y:  df[df['ENDDATE'] < y['STARTDATE'] ].PROFIT.sum())
     14 df['sumco'] = df.apply(lambda y:  df[(df['ENDDATE'] < y['STARTDATE'] )& (df.co==y.co)].PROFIT.sum())

C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    601         key = com._apply_if_callable(key, self)
    602         try:
--> 603             result = self.index.get_value(self, key)
    604 
    605             if not is_scalar(result):

    C:\Users\User\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_value(self, series, key)
   2167         try:
   2168             return self._engine.get_value(s, k,
-> 2169                                           tz=getattr(series.dtype, 'tz', None))
   2170         except KeyError as e1:
   2171             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3557)()

pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3240)()

pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4363)()

KeyError: ('STARTDATE', 'occurred at index CO')

1 个答案:

答案 0 :(得分:1)

我的看法。当然有更好的方法。

import io
import pandas as pd
import numpy as np

text = """CO         SECTOR    PROFIT   STARTMVYEAR TOTALPROFIT REPORTDATE
IBM         TECHNOLOGY  -500    2500        500         2017-01-01
APPLE       TECHNOLOGY   800    4000        300         2017-01-02
GM          INDUSTRIAL   250    1000          0         2017-02-01
IBM    INDUSTRIAL   600    3000        100         2017-03-01
IBM    INDUSTRIAL   600    35000        100         2017-03-02"""

df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0]).sort_values(by="REPORTDATE")
df['sumall'] = df.PROFIT.cumsum()-df['PROFIT']
df['sumco']=df.groupby('CO')['PROFIT'].cumsum()
df['sumco']= np.where(df['sumco'] ==df['PROFIT'], 0, df['sumco'] )
print(df[['CO','REPORTDATE' ,'PROFIT', 'sumall','sumco']])

输出

      CO  REPORTDATE  PROFIT  sumall  sumco
0    IBM  2017-01-01    -500       0      0
1  APPLE  2017-01-02     800    -500      0
2     GM  2017-02-01     250     300      0
3    IBM  2017-03-01     600     550    100
4    IBM  2017-03-02     600    1150    700