使用Timegrouper'1M'对列进行分组和求和会弄乱我的日期索引pandas python

时间:2016-03-11 11:51:10

标签: python pandas

已发现错误:代码片段在下面的解决方案中发布。关于我的结果的问题源于数据源(FEC.GOV)。我找到了它,现在正在继续前进。非常感谢社区关于这个问题的所有时间,耐心,帮助等等!

由于建议使用解决方案来处理github网站上的代码片段,因此我提供了以下指向原始文件的链接(http://fec.gov/finance/disclosure/ftpdet.shtml#a2011_2012)。我使用的是2008年到2014年,数据文件:pas212.zip,数据名称:(对委员会的候选人(和其他支出)的贡献)。同样,下面的代码可以在{https://github.com/Michae108/python-coding.git]找到。提前感谢您解决此问题的任何帮助。 我已经工作了三天,应该是一项非常简单的任务。我导入并连接4“|”分隔值文件。读为pd.df;将日期列设置为date.time。这给了我以下输出:

              cmte_id trans_typ entity_typ state  amount     fec_id    cand_id
date                                                                          
2007-08-15  C00112250       24K        ORG    DC    2000  C00431569  P00003392
2007-09-26  C00119040       24K        CCM    FL    1000  C00367680  H2FL05127
2007-09-26  C00119040       24K        CCM    MD    1000  C00140715  H2MD05155
2007-07-20  C00346296       24K        CCM    CA    1000  C00434571  H8CA37137

其次,我希望能够将索引按一个月的频率进行分组。然后我想根据[trans_typ]和[cand_id]来计算[金额]。

这是我的代码:

import numpy as np
import pandas as pd
import glob

df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, \
    names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
    '12', '13', 'date', '15', '16', '17', '18', '19', '20', \
    '21', '22'], index_col=None, dtype={'date':str}) for f in \
    glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True)  
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df1 = df.set_index('date')
df2 = df1[['1', '6', '7', '10', '15', '16', '17']].copy() 
df2.columns = ['cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount',\
               'fec_id','cand_id']

df2['amount'] = df2['amount'].astype(float)

grouper = df2.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])

df = grouper['amount'].sum()
grouper['amount'].sum().unstack().fillna(0)
#print (df.head())

以下是运行代码的输出:

    trans_typ   24A     24C     24E     24F     24K     24N     24R     24Z
date    cand_id                                 
1954-07-31  S8AK00090   0   0   0   0   1000    0   0   0
1985-09-30  H8OH18088   0   0   36  0   0   0   0   0
1997-04-30  S6ND00058   0   0   0   0   1000    0   0   0

正如您所看到的,在我运行组之后,日期列会变得混乱。我确信我的日期不会再追溯到2007年。我已经尝试过这个简单的任务,按1个月的时间段分组,然后用[trans_typ]和[cand_id]汇总[金额]。它似乎应该很简单,但我找不到任何解决方案。我已经在Stackoverflow上阅读了很多问题,并尝试了不同的技术来解决这个问题。有没有人对此有所了解?

以下是原始数据的示例,如果它有帮助:

C00409409|N|Q2|P|29992447808|24K|CCM|PERRIELLO FOR CONGRESS|IVY|VA|22945|||06262009|500|C00438788|H8VA05106|D310246|424490|||4072320091116608455
C00409409|N|Q2|P|29992447807|24K|CCM|JOHN BOCCIERI FOR CONGRESS|ALLIANCE|OH|44601|||06262009|500|C00435065|H8OH16058|D310244|424490|||4072320091116608452
C00409409|N|Q2|P|29992447807|24K|CCM|MIKE MCMAHON FOR CONGRESS|STATEN ISLAND|NY|10301|||06262009|500|C00451138|H8NY13077|D310245|424490|||4072320091116608453
C00409409|N|Q2|P|29992447808|24K|CCM|MINNICK FOR CONGRESS|BOISE|ID|83701|||06262009|500|C00441105|H8ID01090|D310243|424490|||4072320091116608454
C00409409|N|Q2|P|29992447807|24K|CCM|ADLER FOR CONGRESS|MARLTON|NJ|08053|||06262009|500|C00439067|H8NJ03156|D310247|424490|||4072320091116608451
C00435164|N|Q2|P|29992448007|24K|CCM|ALEXI FOR ILLINOIS EXPLORATORY COMMITTEE||||||06292009|1500|C00459586|S0IL00204|SB21.4124|424495|||4071620091116385529

2 个答案:

答案 0 :(得分:1)

<强>更新

我认为@jezrael已经提到的问题是由于缺少日期而引起的,并且有两行:

df.dropna(subset=['17'], inplace=True)  
df.dropna(subset=['date'], inplace=True) 

这就是为什么你可以先发现&#34;有问题的行&#34;然后消毒它们(设置一些日期,这对你来说很有意义):

import pandas as pd
import glob

def get_headers(fn):
    with open(fn, 'r') as f:
        for line in f:
            if ',' in line:
                return line.strip().split(',')


####################################################
# Data Dictionary - Contributions to Candidates from Committees
# http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml
# http://www.fec.gov/finance/disclosure/metadata/pas2_header_file.csv
#
headers_file = 'pas2_header_file.csv'

interesting_cols = ['CMTE_ID', 'TRANSACTION_TP', 'ENTITY_TP', 'STATE',
                    'TRANSACTION_DT', 'TRANSACTION_AMT', 'OTHER_ID', 'CAND_ID']

#
# rename columns rules
#
rename_cols = {
  'TRANSACTION_TP':     'trans_typ',
  'TRANSACTION_DT':     'date',
  'TRANSACTION_AMT':    'amount',
  'OTHER_ID':           'fec_id',
}

#
# all columns/headers (already renamed)
#
all_cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
            for col in get_headers(headers_file)]

#
# columns to use in read_csv() (already renamed)
#
cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
        for col in get_headers(headers_file) if col in interesting_cols]


####################################################


df = pd.concat(
        (pd.read_csv(
            f,
            sep='|',
            usecols=cols,
            header=None,
            low_memory=False,
            names=all_cols,
            index_col=None,
            parse_dates=['date'],
            date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
         )
         for f in glob.glob('./itpas2.txt'))
     )

# print rows where 'date' is empty
print(df[pd.isnull(df.date)])

#
# sanitize NaT/empty dates in order to prevent problems with an index in future
#
df.date.fillna(pd.Timestamp('20110101'), inplace=True)

# the rest is your code almost unchanged:
grouper = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])
grouper['amount'].sum().unstack().fillna(0)

空日期行:

          cmte_id trans_typ entity_tp state date  amount     fec_id    cand_id
52372   C00317446       24K       NaN    CA  NaT    2500  C00409219  H6CA05195
57731   C00416693       24K       IND    DC  NaT    2500  C00463836  H2NM02126
58386   C00152892       24K       NaN    DC  NaT    1000  C00359034  H0MO06073
145715  C00154641       24K       IND    DC  NaT    1000  C00257337  H2CA37023
193651  C00000992       24K       NaN    MI  NaT     500  C00390724  H4MI07103
212982  C00454074       24E       ORG    CA  NaT    1138  S2TX00312  S2TX00312
212983  C00454074       24E       ORG    CA  NaT    4764  S2TX00312  S2TX00312
212984  C00454074       24E       ORG    CA  NaT    7058  S2MO00403  S2MO00403
212985  C00454074       24E       ORG    CA  NaT    5000  S2MO00403  S2MO00403
212986  C00454074       24E       ORG    CA  NaT   50003  S8WI00158  S8WI00158
212987  C00454074       24E       ORG    CA  NaT    8830  S8WI00158  S8WI00158
212988  C00454074       24E       ORG    CA  NaT   22189  S8WI00158  S8WI00158
212989  C00454074       24E       ORG    CA  NaT   11258  S8WI00158  S8WI00158
212990  C00454074       24E       ORG    CA  NaT    5000  S8WI00158  S8WI00158
212991  C00454074       24E       ORG    CA  NaT    7743  S2MO00403  S2MO00403
212992  C00454074       24E       ORG    CA  NaT   12463  S0MI00056  S0MI00056
212993  C00454074       24E       ORG    CA  NaT    2795  S8WI00158  S8WI00158
213034  C00454074       24E       ORG    CA  NaT    6431  S2IN00083  S2IN00083
213035  C00454074       24E       ORG    CA  NaT   28015  S2TX00312  S2TX00312
213036  C00454074       24E       ORG    CA  NaT    5395  S8NE00091  S8NE00091
213037  C00454074       24E       ORG    CA  NaT   19399  S2MO00403  S2MO00403
213038  C00454074       24E       ORG    CA  NaT    2540  S2IN00083  S2IN00083
213039  C00454074       24E       ORG    FL  NaT    1500  S2IN00083  S2IN00083
213040  C00454074       24E       ORG    CA  NaT    8065  S2TX00312  S2TX00312
213041  C00454074       24E       ORG    CA  NaT   11764  S2TX00312  S2TX00312
213042  C00454074       24E       ORG    CA  NaT   61214  S2TX00312  S2TX00312
213043  C00454074       24E       ORG    CA  NaT   44634  S2MO00403  S2MO00403
213044  C00454074       24E       ORG    TN  NaT   15000  S2TX00312  S2TX00312
213045  C00454074       24E       ORG    CA  NaT    5176  S2TX00312  S2TX00312
214642  C90014358       24E       NaN    VA  NaT    2000  S6MT00097  S6MT00097
214643  C90014358       24E       NaN    VA  NaT    2000  H2MT01060  H2MT01060
214644  C90014358       24E       NaN    DC  NaT     139  S6MT00097  S6MT00097
214645  C90014358       24E       NaN    DC  NaT     139  H2MT01060  H2MT01060
214646  C90014358       24E       NaN    DC  NaT     149  S6MT00097  S6MT00097
214647  C90014358       24E       NaN    DC  NaT     149  H2MT01060  H2MT01060
216428  C00023580       24E       ORG    VA  NaT    3352  P80003338  P80003338
216445  C00023580       24E       ORG    VA  NaT     250  P80003338  P80003338
216446  C00023580       24E       ORG    VA  NaT     333  P80003338  P80003338
216447  C00023580       24E       ORG    VA  NaT    2318  P80003338  P80003338
216448  C00023580       24E       ORG    VA  NaT     583  P80003338  P80003338
216449  C00023580       24E       ORG    VA  NaT    2969  P80003338  P80003338
216450  C00023580       24E       ORG    VA  NaT   14011  P80003338  P80003338
216451  C00023580       24E       ORG    VA  NaT     383  P80003338  P80003338
216452  C00023580       24E       ORG    VA  NaT     366  P80003338  P80003338
216453  C00023580       24E       ORG    VA  NaT     984  P80003338  P80003338
216454  C00023580       24E       ORG    VA  NaT     542  P80003338  P80003338
216503  C00023580       24E       ORG    VA  NaT    3077  P80003338  P80003338
216504  C00023580       24E       ORG    VA  NaT    3002  P80003338  P80003338
216505  C00023580       24E       ORG    VA  NaT    5671  P80003338  P80003338
216506  C00023580       24E       ORG    VA  NaT    3853  P80003338  P80003338
231905  C00454074       24E       ORG    CA  NaT   26049  S4WV00084  S4WV00084
231906  C00454074       24E       ORG    NC  NaT  135991  P80003353  P80003353
231907  C00454074       24E       ORG    FL  NaT    5000  P80003353  P80003353
231908  C00454074       24E       ORG    TX  NaT   12500  P80003353  P80003353
231909  C00454074       24A       ORG    TX  NaT   12500  P80003338  P80003338
234844  C00417519       24K       NaN    NY  NaT    2500  C00272633  H2NY26080
281989  C00427203       24K       NaN    DC  NaT     500  C00412304  S6MT00162
309146  C00500785       24A       NaN   NaN  NaT       0  H4FL20023  H4FL20023
310225  C00129189       24K       NaN    MI  NaT    1000  C00347476  H0MI10071

PS我已经添加了一些帮助函数/变量(get_headers()interesting_colsrename_colsall_colscols)这些可能对您有所帮助从以后的fec.gov处理不同的数据/ CSV文件

基于样本数据的原始答案

指定&#34; cut&#34;的代码样本数据集:

#import numpy as np
import pandas as pd
import glob

#dtparser = lambda x: pd.datetime.fromtimestamp(int(x))

cols = ['cmte_id', 'trans_typ', 'entity_typ', 'state',
        'date', 'amount', 'fec_id', 'cand_id']

df = pd.concat(
        (pd.read_csv(
            f,
            sep='|',
            usecols=[0, 5, 6, 9, 13, 14, 15, 16],
            header=None,
            low_memory=False,
            #names=cols,
            index_col=None,
            parse_dates=[13],
            date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
            #dtype={5: np.float64}
         )
         for f in glob.glob('./itpas2**github.txt'))
     )
df.columns = cols
df.trans_typ = df.trans_typ.astype('category')
#print(df.head())
#print(df.dtypes)

a = df.set_index('date').\
        groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ']).\
        agg({'amount': sum}).\
        reset_index()

print(a.pivot_table(index=['date', 'cand_id'],
                    columns='trans_typ',
                    values='amount',
                    fill_value=0,
                    aggfunc='sum').tail(10))

输出:

trans_typ             24A  24C  24E  24F   24K  24N  24R  24Z
date       cand_id
2013-02-28 S0FL00312    0    0    0    0     0    0    0    0
           S0IA00028    0    0    0    0     0    0    0    0
           S0IL00204    0    0    0    0     0    0    0    0
           S2ND00099    0    0    0    0  1000    0    0    0
           S4ME00055    0    0    0    0     0    0    0    0
           S4SC00240    0    0    0    0  5000    0    0    0
           S6MN00267    0    0    0    0     0    0    0    0
           S6NV00028    0    0    0    0  2500    0    0    0
           S6PA00100    0    0    0    0     0    0    0    0
           S8MT00010    0    0    0    0  3500    0    0    0

PS在您的文件中,trans_typ 24K 只有一个值,因此无法轮播。所以我在CSV文件中操作它,以便我们现在有不同的值

答案 1 :(得分:1)

这非常复杂。 Date_parser返回错误,因此第一列date已转换为read_csv中的string。然后,date列被转换为to_datetime,并删除了所有NaN值。最后,您可以使用groupbyunstack

import pandas as pd
import glob



#change path by your 
df = pd.concat((pd.read_csv(f, 
                            sep='|', 
                            header=None, 
                            names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'], 
                            usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
                            dtype={'date': str}
                           ) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)


#parse column date to datetime
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')

#remove rows, where date is NaN
df = df[df['date'].notnull()]

#set column date to index
df = df.set_index('date')

g = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])['amount'].sum()
print g.unstack().fillna(0)

trans_typ                24A  24C   24E  24F     24K  24N  24R  24Z
date       cand_id                                                 
2001-09-30 H2HI02110       0    0     0    0    2500    0    0    0
2007-03-31 S6TN00216       0    0     0    0    2000    0    0    0
2007-10-31 H8IL21021       0    0     0    0   -1000    0    0    0
2008-03-31 S6TN00216       0    0     0    0    1000    0    0    0
2008-07-31 H2PA11098       0    0     0    0    1000    0    0    0
           H4KS03105       0    0     0    0   49664    0    0    0
           H6KS03183       0    0     0    0    1000    0    0    0
2008-10-31 H8KS02090       0    0     0    0    1000    0    0    0
           S6TN00216       0    0     0    0    1500    0    0    0
2008-12-31 H6KS01146       0    0     0    0    2000    0    0    0
2009-02-28 S6OH00163       0    0     0    0   -1000    0    0    0
2009-03-31 S2KY00012       0    0     0    0    2000    0    0    0
           S6WY00068       0    0     0    0   -2500    0    0    0
2009-06-30 S6TN00216       0    0     0    0   -1000    0    0    0
2009-08-31 S0MO00183       0    0     0    0    1000    0    0    0
2009-09-30 S0NY00410       0    0     0    0    1000    0    0    0
2009-10-31 S6OH00163       0    0     0    0   -2500    0    0    0
           S6WY00068       0    0     0    0   -1000    0    0    0
2009-11-30 H8MO09153       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -1000    0    0    0
           S6OH00163       0    0     0    0    -500    0    0    0
2009-12-31 H0MO00019       0    0     0    0     500    0    0    0
           S6TN00216       0    0     0    0   -1000    0    0    0
2010-01-31 H0CT03072       0    0     0    0     250    0    0    0
           S0MA00109       0    0     0    0    5000    0    0    0
2010-02-28 S6TN00216       0    0     0    0   -1500    0    0    0
2010-03-31 H0MO00019       0    0     0    0     500    0    0    0
           S0NY00410       0    0     0    0   -2500    0    0    0
2010-05-31 H0MO06149       0    0     0    0     530    0    0    0
           S6OH00163       0    0     0    0   -1000    0    0    0
...                      ...  ...   ...  ...     ...  ...  ...  ...
2012-12-31 S6UT00063       0    0     0    0    5000    0    0    0
           S6VA00093       0    0     0    0   97250    0    0    0
           S6WY00068       0    0     0    0    1500    0    0    0
           S6WY00126       0    0     0    0   11000    0    0    0
           S8AK00090       0    0     0    0  132350    0    0    0
           S8CO00172       0    0     0    0   88500    0    0    0
           S8DE00079       0    0     0    0    6000    0    0    0
           S8FL00166       0    0     0    0    -932    0    0  651
           S8ID00027       0    0     0    0   13000    0    0  326
           S8ID00092       0    0     0    0    2500    0    0    0
           S8MI00158       0    0     0    0    7500    0    0    0
           S8MI00281     110    0     0    0    3000    0    0    0
           S8MN00438       0    0     0    0   65500    0    0    0
           S8MS00055       0    0     0    0   21500    0    0    0
           S8MS00196       0    0     0    0     500    0    0  650
           S8MT00010       0    0     0    0  185350    0    0    0
           S8NC00239       0    0     0    0   67000    0    0    0
           S8NE00067       0   40     0    0       0    0    0    0
           S8NE00117       0    0     0    0   13000    0    0    0
           S8NJ00392       0    0     0    0   -5000    0    0    0
           S8NM00168       0    0     0    0   -2000    0    0    0
           S8NM00184       0    0     0    0   51000    0    0    0
           S8NY00082       0    0     0    0    1000    0    0    0
           S8OR00207       0    0     0    0   23500    0    0    0
           S8VA00214       0    0   120    0   -2000    0    0    0
           S8WA00194       0    0     0    0   -4500    0    0    0
2013-10-31 P80003338  314379    0     0    0       0    0    0    0
           S8VA00214   14063    0     0    0       0    0    0    0
2013-11-30 H2NJ03183       0    0  2333    0       0    0    0    0
2014-10-31 S6PA00217       0    0     0    0    1500    0    0    0