我正在尝试构建一些东西,对于pandas数据库中的每条记录,它将显示给定列的总数,并且还显示给定列中某些记录之前发生的特定记录的总数。记录。
请注意,与所有记录的ENDDATE相比,当前记录的比较应为STARTDATE (仅限当前期间结束前的期间的利润)
我需要澄清一下,因为迭戈Amicabile在下面提出了一个非常漂亮的答案,遗憾的是我没有找到我需要的地方(我最初只用报告日期字段发布了这个问题)
所以在这个数据框中,我希望最后有两列。总利润(或sumall)和公司利润(或sumco)
Sumall,第一条记录为0,第二条记录为-500(2017-01-01之前的所有日期),第3条记录为300(-500 + 800)等
Sumco将是0,直到我们达到第二个IBM记录,这将是-500。它在第3条IBM记录中仍为-500,因为第二条记录(2017-03-03)的结束时间是在第3条记录的开始时间之后。
它应如下所示:
import io
import pandas as pd
text = """CO SECTOR PROFIT STARTMVYEAR TOTALPROFIT STARTDATE ENDDATE
IBM TECHNOLOGY -500 2500 500 2017-01-01 2017-01-01
APPLE TECHNOLOGY 800 4000 300 2017-01-02 2017-01-03
GM INDUSTRIAL 250 1000 0 2017-02-01 2017-02-03
IBM INDUSTRIAL 600 3000 100 2017-03-01 2017-03-03
IBM INDUSTRIAL 600 35000 100 2017-03-02 2017-06-01"""
df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0])
df['sumall'] = df.apply(lambda y: df[df['ENDDATE'] < y['STARTDATE'] ].PROFIT.sum())
df['sumco'] = df.apply(lambda y: df[(df['ENDDATE'] < y['STARTDATE'] )& (df.co==y.co)].PROFIT.sum())
错误如下:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()
pandas\src\hashtable_class_helper.pxi in
pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)()
TypeError: an integer is required
C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4150 if reduce is None:
4151 reduce = True
-> 4152 return self._apply_standard(f, axis, reduce=reduce)
4153 else:
4154 return self._apply_broadcast(f, axis)
C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4246 try:
4247 for i, v in enumerate(series_gen):
-> 4248 results[i] = func(v)
4249 keys.append(v.name)
4250 except Exception as e:
<ipython-input-13-92e1d7684747> in <lambda>(y)
KeyError Traceback (most recent call last)
<ipython-input-13-92e1d7684747> in <module>()
11 df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0])
12
---> 13 df['sumall'] = df.apply(lambda y: df[df['ENDDATE'] < y['STARTDATE'] ].PROFIT.sum())
14 df['sumco'] = df.apply(lambda y: df[(df['ENDDATE'] < y['STARTDATE'] )& (df.co==y.co)].PROFIT.sum())
C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
601 key = com._apply_if_callable(key, self)
602 try:
--> 603 result = self.index.get_value(self, key)
604
605 if not is_scalar(result):
C:\Users\User\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_value(self, series, key)
2167 try:
2168 return self._engine.get_value(s, k,
-> 2169 tz=getattr(series.dtype, 'tz', None))
2170 except KeyError as e1:
2171 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3557)()
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3240)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4363)()
KeyError: ('STARTDATE', 'occurred at index CO')
答案 0 :(得分:1)
我的看法。当然有更好的方法。
import io
import pandas as pd
import numpy as np
text = """CO SECTOR PROFIT STARTMVYEAR TOTALPROFIT REPORTDATE
IBM TECHNOLOGY -500 2500 500 2017-01-01
APPLE TECHNOLOGY 800 4000 300 2017-01-02
GM INDUSTRIAL 250 1000 0 2017-02-01
IBM INDUSTRIAL 600 3000 100 2017-03-01
IBM INDUSTRIAL 600 35000 100 2017-03-02"""
df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0]).sort_values(by="REPORTDATE")
df['sumall'] = df.PROFIT.cumsum()-df['PROFIT']
df['sumco']=df.groupby('CO')['PROFIT'].cumsum()
df['sumco']= np.where(df['sumco'] ==df['PROFIT'], 0, df['sumco'] )
print(df[['CO','REPORTDATE' ,'PROFIT', 'sumall','sumco']])
输出
CO REPORTDATE PROFIT sumall sumco
0 IBM 2017-01-01 -500 0 0
1 APPLE 2017-01-02 800 -500 0
2 GM 2017-02-01 250 300 0
3 IBM 2017-03-01 600 550 100
4 IBM 2017-03-02 600 1150 700