已发现错误:代码片段在下面的解决方案中发布。关于我的结果的问题源于数据源(FEC.GOV)。我找到了它,现在正在继续前进。非常感谢社区关于这个问题的所有时间,耐心,帮助等等!
由于建议使用解决方案来处理github网站上的代码片段,因此我提供了以下指向原始文件的链接(http://fec.gov/finance/disclosure/ftpdet.shtml#a2011_2012)。我使用的是2008年到2014年,数据文件:pas212.zip,数据名称:(对委员会的候选人(和其他支出)的贡献)。同样,下面的代码可以在{https://github.com/Michae108/python-coding.git]找到。提前感谢您解决此问题的任何帮助。 我已经工作了三天,应该是一项非常简单的任务。我导入并连接4“|”分隔值文件。读为pd.df;将日期列设置为date.time。这给了我以下输出:
cmte_id trans_typ entity_typ state amount fec_id cand_id
date
2007-08-15 C00112250 24K ORG DC 2000 C00431569 P00003392
2007-09-26 C00119040 24K CCM FL 1000 C00367680 H2FL05127
2007-09-26 C00119040 24K CCM MD 1000 C00140715 H2MD05155
2007-07-20 C00346296 24K CCM CA 1000 C00434571 H8CA37137
其次,我希望能够将索引按一个月的频率进行分组。然后我想根据[trans_typ]和[cand_id]来计算[金额]。
这是我的代码:
import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, sep='|', header=None, low_memory=False, \
names=['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', \
'12', '13', 'date', '15', '16', '17', '18', '19', '20', \
'21', '22'], index_col=None, dtype={'date':str}) for f in \
glob.glob('/home/jayaramdas/anaconda3/Thesis/FEC_data/itpas2_data/itpas2**.txt')))
df.dropna(subset=['17'], inplace=True)
df.dropna(subset=['date'], inplace=True)
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
df1 = df.set_index('date')
df2 = df1[['1', '6', '7', '10', '15', '16', '17']].copy()
df2.columns = ['cmte_id', 'trans_typ', 'entity_typ', 'state', 'amount',\
'fec_id','cand_id']
df2['amount'] = df2['amount'].astype(float)
grouper = df2.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])
df = grouper['amount'].sum()
grouper['amount'].sum().unstack().fillna(0)
#print (df.head())
以下是运行代码的输出:
trans_typ 24A 24C 24E 24F 24K 24N 24R 24Z
date cand_id
1954-07-31 S8AK00090 0 0 0 0 1000 0 0 0
1985-09-30 H8OH18088 0 0 36 0 0 0 0 0
1997-04-30 S6ND00058 0 0 0 0 1000 0 0 0
正如您所看到的,在我运行组之后,日期列会变得混乱。我确信我的日期不会再追溯到2007年。我已经尝试过这个简单的任务,按1个月的时间段分组,然后用[trans_typ]和[cand_id]汇总[金额]。它似乎应该很简单,但我找不到任何解决方案。我已经在Stackoverflow上阅读了很多问题,并尝试了不同的技术来解决这个问题。有没有人对此有所了解?
以下是原始数据的示例,如果它有帮助:
C00409409|N|Q2|P|29992447808|24K|CCM|PERRIELLO FOR CONGRESS|IVY|VA|22945|||06262009|500|C00438788|H8VA05106|D310246|424490|||4072320091116608455
C00409409|N|Q2|P|29992447807|24K|CCM|JOHN BOCCIERI FOR CONGRESS|ALLIANCE|OH|44601|||06262009|500|C00435065|H8OH16058|D310244|424490|||4072320091116608452
C00409409|N|Q2|P|29992447807|24K|CCM|MIKE MCMAHON FOR CONGRESS|STATEN ISLAND|NY|10301|||06262009|500|C00451138|H8NY13077|D310245|424490|||4072320091116608453
C00409409|N|Q2|P|29992447808|24K|CCM|MINNICK FOR CONGRESS|BOISE|ID|83701|||06262009|500|C00441105|H8ID01090|D310243|424490|||4072320091116608454
C00409409|N|Q2|P|29992447807|24K|CCM|ADLER FOR CONGRESS|MARLTON|NJ|08053|||06262009|500|C00439067|H8NJ03156|D310247|424490|||4072320091116608451
C00435164|N|Q2|P|29992448007|24K|CCM|ALEXI FOR ILLINOIS EXPLORATORY COMMITTEE||||||06292009|1500|C00459586|S0IL00204|SB21.4124|424495|||4071620091116385529
答案 0 :(得分:1)
<强>更新强>
我认为@jezrael已经提到的问题是由于缺少日期而引起的,并且有两行:
df.dropna(subset=['17'], inplace=True)
df.dropna(subset=['date'], inplace=True)
这就是为什么你可以先发现&#34;有问题的行&#34;然后消毒它们(设置一些日期,这对你来说很有意义):
import pandas as pd
import glob
def get_headers(fn):
with open(fn, 'r') as f:
for line in f:
if ',' in line:
return line.strip().split(',')
####################################################
# Data Dictionary - Contributions to Candidates from Committees
# http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml
# http://www.fec.gov/finance/disclosure/metadata/pas2_header_file.csv
#
headers_file = 'pas2_header_file.csv'
interesting_cols = ['CMTE_ID', 'TRANSACTION_TP', 'ENTITY_TP', 'STATE',
'TRANSACTION_DT', 'TRANSACTION_AMT', 'OTHER_ID', 'CAND_ID']
#
# rename columns rules
#
rename_cols = {
'TRANSACTION_TP': 'trans_typ',
'TRANSACTION_DT': 'date',
'TRANSACTION_AMT': 'amount',
'OTHER_ID': 'fec_id',
}
#
# all columns/headers (already renamed)
#
all_cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
for col in get_headers(headers_file)]
#
# columns to use in read_csv() (already renamed)
#
cols = [rename_cols.get(col) if col in rename_cols.keys() else col.lower()
for col in get_headers(headers_file) if col in interesting_cols]
####################################################
df = pd.concat(
(pd.read_csv(
f,
sep='|',
usecols=cols,
header=None,
low_memory=False,
names=all_cols,
index_col=None,
parse_dates=['date'],
date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
)
for f in glob.glob('./itpas2.txt'))
)
# print rows where 'date' is empty
print(df[pd.isnull(df.date)])
#
# sanitize NaT/empty dates in order to prevent problems with an index in future
#
df.date.fillna(pd.Timestamp('20110101'), inplace=True)
# the rest is your code almost unchanged:
grouper = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])
grouper['amount'].sum().unstack().fillna(0)
空日期行:
cmte_id trans_typ entity_tp state date amount fec_id cand_id
52372 C00317446 24K NaN CA NaT 2500 C00409219 H6CA05195
57731 C00416693 24K IND DC NaT 2500 C00463836 H2NM02126
58386 C00152892 24K NaN DC NaT 1000 C00359034 H0MO06073
145715 C00154641 24K IND DC NaT 1000 C00257337 H2CA37023
193651 C00000992 24K NaN MI NaT 500 C00390724 H4MI07103
212982 C00454074 24E ORG CA NaT 1138 S2TX00312 S2TX00312
212983 C00454074 24E ORG CA NaT 4764 S2TX00312 S2TX00312
212984 C00454074 24E ORG CA NaT 7058 S2MO00403 S2MO00403
212985 C00454074 24E ORG CA NaT 5000 S2MO00403 S2MO00403
212986 C00454074 24E ORG CA NaT 50003 S8WI00158 S8WI00158
212987 C00454074 24E ORG CA NaT 8830 S8WI00158 S8WI00158
212988 C00454074 24E ORG CA NaT 22189 S8WI00158 S8WI00158
212989 C00454074 24E ORG CA NaT 11258 S8WI00158 S8WI00158
212990 C00454074 24E ORG CA NaT 5000 S8WI00158 S8WI00158
212991 C00454074 24E ORG CA NaT 7743 S2MO00403 S2MO00403
212992 C00454074 24E ORG CA NaT 12463 S0MI00056 S0MI00056
212993 C00454074 24E ORG CA NaT 2795 S8WI00158 S8WI00158
213034 C00454074 24E ORG CA NaT 6431 S2IN00083 S2IN00083
213035 C00454074 24E ORG CA NaT 28015 S2TX00312 S2TX00312
213036 C00454074 24E ORG CA NaT 5395 S8NE00091 S8NE00091
213037 C00454074 24E ORG CA NaT 19399 S2MO00403 S2MO00403
213038 C00454074 24E ORG CA NaT 2540 S2IN00083 S2IN00083
213039 C00454074 24E ORG FL NaT 1500 S2IN00083 S2IN00083
213040 C00454074 24E ORG CA NaT 8065 S2TX00312 S2TX00312
213041 C00454074 24E ORG CA NaT 11764 S2TX00312 S2TX00312
213042 C00454074 24E ORG CA NaT 61214 S2TX00312 S2TX00312
213043 C00454074 24E ORG CA NaT 44634 S2MO00403 S2MO00403
213044 C00454074 24E ORG TN NaT 15000 S2TX00312 S2TX00312
213045 C00454074 24E ORG CA NaT 5176 S2TX00312 S2TX00312
214642 C90014358 24E NaN VA NaT 2000 S6MT00097 S6MT00097
214643 C90014358 24E NaN VA NaT 2000 H2MT01060 H2MT01060
214644 C90014358 24E NaN DC NaT 139 S6MT00097 S6MT00097
214645 C90014358 24E NaN DC NaT 139 H2MT01060 H2MT01060
214646 C90014358 24E NaN DC NaT 149 S6MT00097 S6MT00097
214647 C90014358 24E NaN DC NaT 149 H2MT01060 H2MT01060
216428 C00023580 24E ORG VA NaT 3352 P80003338 P80003338
216445 C00023580 24E ORG VA NaT 250 P80003338 P80003338
216446 C00023580 24E ORG VA NaT 333 P80003338 P80003338
216447 C00023580 24E ORG VA NaT 2318 P80003338 P80003338
216448 C00023580 24E ORG VA NaT 583 P80003338 P80003338
216449 C00023580 24E ORG VA NaT 2969 P80003338 P80003338
216450 C00023580 24E ORG VA NaT 14011 P80003338 P80003338
216451 C00023580 24E ORG VA NaT 383 P80003338 P80003338
216452 C00023580 24E ORG VA NaT 366 P80003338 P80003338
216453 C00023580 24E ORG VA NaT 984 P80003338 P80003338
216454 C00023580 24E ORG VA NaT 542 P80003338 P80003338
216503 C00023580 24E ORG VA NaT 3077 P80003338 P80003338
216504 C00023580 24E ORG VA NaT 3002 P80003338 P80003338
216505 C00023580 24E ORG VA NaT 5671 P80003338 P80003338
216506 C00023580 24E ORG VA NaT 3853 P80003338 P80003338
231905 C00454074 24E ORG CA NaT 26049 S4WV00084 S4WV00084
231906 C00454074 24E ORG NC NaT 135991 P80003353 P80003353
231907 C00454074 24E ORG FL NaT 5000 P80003353 P80003353
231908 C00454074 24E ORG TX NaT 12500 P80003353 P80003353
231909 C00454074 24A ORG TX NaT 12500 P80003338 P80003338
234844 C00417519 24K NaN NY NaT 2500 C00272633 H2NY26080
281989 C00427203 24K NaN DC NaT 500 C00412304 S6MT00162
309146 C00500785 24A NaN NaN NaT 0 H4FL20023 H4FL20023
310225 C00129189 24K NaN MI NaT 1000 C00347476 H0MI10071
PS我已经添加了一些帮助函数/变量(get_headers()
,interesting_cols
,rename_cols
,all_cols
,cols
)这些可能对您有所帮助从以后的fec.gov处理不同的数据/ CSV文件
基于样本数据的原始答案
指定&#34; cut&#34;的代码样本数据集:
#import numpy as np
import pandas as pd
import glob
#dtparser = lambda x: pd.datetime.fromtimestamp(int(x))
cols = ['cmte_id', 'trans_typ', 'entity_typ', 'state',
'date', 'amount', 'fec_id', 'cand_id']
df = pd.concat(
(pd.read_csv(
f,
sep='|',
usecols=[0, 5, 6, 9, 13, 14, 15, 16],
header=None,
low_memory=False,
#names=cols,
index_col=None,
parse_dates=[13],
date_parser=lambda x: pd.to_datetime(x, format='%m%d%Y'),
#dtype={5: np.float64}
)
for f in glob.glob('./itpas2**github.txt'))
)
df.columns = cols
df.trans_typ = df.trans_typ.astype('category')
#print(df.head())
#print(df.dtypes)
a = df.set_index('date').\
groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ']).\
agg({'amount': sum}).\
reset_index()
print(a.pivot_table(index=['date', 'cand_id'],
columns='trans_typ',
values='amount',
fill_value=0,
aggfunc='sum').tail(10))
输出:
trans_typ 24A 24C 24E 24F 24K 24N 24R 24Z
date cand_id
2013-02-28 S0FL00312 0 0 0 0 0 0 0 0
S0IA00028 0 0 0 0 0 0 0 0
S0IL00204 0 0 0 0 0 0 0 0
S2ND00099 0 0 0 0 1000 0 0 0
S4ME00055 0 0 0 0 0 0 0 0
S4SC00240 0 0 0 0 5000 0 0 0
S6MN00267 0 0 0 0 0 0 0 0
S6NV00028 0 0 0 0 2500 0 0 0
S6PA00100 0 0 0 0 0 0 0 0
S8MT00010 0 0 0 0 3500 0 0 0
PS在您的文件中,trans_typ
24K 只有一个值,因此无法轮播。所以我在CSV文件中操作它,以便我们现在有不同的值
答案 1 :(得分:1)
这非常复杂。 Date_parser
返回错误,因此第一列date
已转换为read_csv
中的string
。然后,date
列被转换为to_datetime
,并删除了所有NaN
值。最后,您可以使用groupby
和unstack
:
import pandas as pd
import glob
#change path by your
df = pd.concat((pd.read_csv(f,
sep='|',
header=None,
names=['cmte_id', '2', '3', '4', '5', 'trans_typ', 'entity_typ', '8', '9', 'state', '11', 'employer', 'occupation', 'date', 'amount', 'fec_id', 'cand_id', '18', '19', '20', '21', '22'],
usecols= ['date', 'cmte_id', 'trans_typ', 'entity_typ', 'state', 'employer', 'occupation', 'amount', 'fec_id', 'cand_id'],
dtype={'date': str}
) for f in glob.glob('test/itpas2_data/itpas2**.txt')), ignore_index=True)
#parse column date to datetime
df['date'] = pd.to_datetime(df['date'], format='%m%d%Y')
#remove rows, where date is NaN
df = df[df['date'].notnull()]
#set column date to index
df = df.set_index('date')
g = df.groupby([pd.TimeGrouper('1M'), 'cand_id', 'trans_typ'])['amount'].sum()
print g.unstack().fillna(0)
trans_typ 24A 24C 24E 24F 24K 24N 24R 24Z
date cand_id
2001-09-30 H2HI02110 0 0 0 0 2500 0 0 0
2007-03-31 S6TN00216 0 0 0 0 2000 0 0 0
2007-10-31 H8IL21021 0 0 0 0 -1000 0 0 0
2008-03-31 S6TN00216 0 0 0 0 1000 0 0 0
2008-07-31 H2PA11098 0 0 0 0 1000 0 0 0
H4KS03105 0 0 0 0 49664 0 0 0
H6KS03183 0 0 0 0 1000 0 0 0
2008-10-31 H8KS02090 0 0 0 0 1000 0 0 0
S6TN00216 0 0 0 0 1500 0 0 0
2008-12-31 H6KS01146 0 0 0 0 2000 0 0 0
2009-02-28 S6OH00163 0 0 0 0 -1000 0 0 0
2009-03-31 S2KY00012 0 0 0 0 2000 0 0 0
S6WY00068 0 0 0 0 -2500 0 0 0
2009-06-30 S6TN00216 0 0 0 0 -1000 0 0 0
2009-08-31 S0MO00183 0 0 0 0 1000 0 0 0
2009-09-30 S0NY00410 0 0 0 0 1000 0 0 0
2009-10-31 S6OH00163 0 0 0 0 -2500 0 0 0
S6WY00068 0 0 0 0 -1000 0 0 0
2009-11-30 H8MO09153 0 0 0 0 500 0 0 0
S0NY00410 0 0 0 0 -1000 0 0 0
S6OH00163 0 0 0 0 -500 0 0 0
2009-12-31 H0MO00019 0 0 0 0 500 0 0 0
S6TN00216 0 0 0 0 -1000 0 0 0
2010-01-31 H0CT03072 0 0 0 0 250 0 0 0
S0MA00109 0 0 0 0 5000 0 0 0
2010-02-28 S6TN00216 0 0 0 0 -1500 0 0 0
2010-03-31 H0MO00019 0 0 0 0 500 0 0 0
S0NY00410 0 0 0 0 -2500 0 0 0
2010-05-31 H0MO06149 0 0 0 0 530 0 0 0
S6OH00163 0 0 0 0 -1000 0 0 0
... ... ... ... ... ... ... ... ...
2012-12-31 S6UT00063 0 0 0 0 5000 0 0 0
S6VA00093 0 0 0 0 97250 0 0 0
S6WY00068 0 0 0 0 1500 0 0 0
S6WY00126 0 0 0 0 11000 0 0 0
S8AK00090 0 0 0 0 132350 0 0 0
S8CO00172 0 0 0 0 88500 0 0 0
S8DE00079 0 0 0 0 6000 0 0 0
S8FL00166 0 0 0 0 -932 0 0 651
S8ID00027 0 0 0 0 13000 0 0 326
S8ID00092 0 0 0 0 2500 0 0 0
S8MI00158 0 0 0 0 7500 0 0 0
S8MI00281 110 0 0 0 3000 0 0 0
S8MN00438 0 0 0 0 65500 0 0 0
S8MS00055 0 0 0 0 21500 0 0 0
S8MS00196 0 0 0 0 500 0 0 650
S8MT00010 0 0 0 0 185350 0 0 0
S8NC00239 0 0 0 0 67000 0 0 0
S8NE00067 0 40 0 0 0 0 0 0
S8NE00117 0 0 0 0 13000 0 0 0
S8NJ00392 0 0 0 0 -5000 0 0 0
S8NM00168 0 0 0 0 -2000 0 0 0
S8NM00184 0 0 0 0 51000 0 0 0
S8NY00082 0 0 0 0 1000 0 0 0
S8OR00207 0 0 0 0 23500 0 0 0
S8VA00214 0 0 120 0 -2000 0 0 0
S8WA00194 0 0 0 0 -4500 0 0 0
2013-10-31 P80003338 314379 0 0 0 0 0 0 0
S8VA00214 14063 0 0 0 0 0 0 0
2013-11-30 H2NJ03183 0 0 2333 0 0 0 0 0
2014-10-31 S6PA00217 0 0 0 0 1500 0 0 0