FEC Data Scraping - 如何处理.do扩展?

时间:2017-03-04 01:09:53

标签: r web-scraping

我正在寻求自动化从FEC下载数据的过程,但我仍然是高级数据抓取的业余爱好者。

我希望脚本执行的操作:

下载特定候选PAC的详细个人和其他委员会贡献CSV文件。 FEC没有静态页面上的数据。相反,它是一些时髦的Javascript(我认为)数据扩展。给定PAC的域始终相同:

http://www.fec.gov/fecviewer/CandidateCommitteeDetail.do

该链接将您带到搜索框,而不是静态页面,我不知道如何在代码中解决它。

我不完全确定如何访问我需要的页面,也不知道如何将CSV文件下载到我选择的位置。任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:2)

这应该给你一个起点。如果您要抓取,那么学习如何使用浏览器开发者工具将是一个很好的时间投资。

library(httr)
library(jsonlite)
library(tidyverse)

POST(url = "http://www.fec.gov/fecviewer/ExportImageSearchResults.do",
     body = list(format = "json",
                 candCmteIdName = "",
                 state = "ME",
                 district = "",
                 city = "",
                 treasurerName = "",
                 reportYear = "",
                 covStartDate = "",
                 covEndDate = "",
                 defaultTab = "1"),
     encode = "form") -> res

res_j <- fromJSON(content(res, as="text"))
map_df(res_j$fec.gov$results, flatten_df) %>%
  glimpse()
## Observations: 343
## Variables: 9
## $ ID                              <chr> "S4ME00071", "S4ME00089", "H6M...
## $ Name                            <chr> "BELLOWS, SHENNA", "BENNETT, E...
## $ Treasurer Name                  <chr> "null", "null", "null", "null"...
## $ Active Through                  <chr> "2018", "2018", "2018", "2018"...
## $ City                            <chr> "MANCHESTER", "PORTLAND", "BRU...
## $ State                           <chr> "ME", "ME", "ME", "ME", "ME", ...
## $ Party                           <chr> "DEMOCRATIC PARTY", "REPUBLICA...
## $ Committee Type/Candidate Office <chr> "S - Senate", "S - Senate", "H...
## $ Committee Designation           <chr> "null", "null", "null", "null"...

如果您在开发者工具的“网络”标签上更多地检查网页交易,您会看到对此类资源的其他调用,例如:

POST(url = "http://www.fec.gov/fecviewer/ExportCandidateCommitteeCurrentSummary.do",
     body = list(format = "json",
                 electionYr = "2016",
                 tabIndex = "1",
                 candidateCommitteeId = "S4ME00071",
                 conCandidateCommitteId = "C00550434",
                 conCandidateCommitteeName = "BELLOWS+FOR+SENATE",
                 lineNumber = "",
                 lineDescription = "",
                 commingFrom = "twoYearSummary",
                 comingFromCashExpSummary = "false",
                 electionYrOpt = "2016"),
     encode = "form")

GET(url = "http://www.fec.gov/fecviewer/CommitteeDetailCurrentSummary.do",
    query=list(tabIndex=1,
               candidateCommitteeId="H6ME02130",
               electionYr=2016))

此外,您可以批量下载数据文件:http://www.fec.gov/finance/disclosure/ftpdet.shtml

答案 1 :(得分:0)

.dostruts https://struts.apache.org/扩展名:它将在每次调用时生成服务器端页面,并基于可见和隐藏的servlet参数。我不确定是否会有一种系统的方法来解析/抓取。

答案 2 :(得分:0)

我在这里发布了有关如何进行下载和转换的说明:

https://github.com/AaronNHorvitz/Federal-Election-Commission-FEC-Data/blob/master/.gitignore/Convert_Contribution_By_Individuals_2016.py

2016年的个人贡献和头文件可在此链接中找到FEC:

https://classic.fec.gov/finance/disclosure/ftpdet.shtml

但是,文件以ASCII-28格式分隔,列名称的头文件与CSV格式的主文件分开。

例如:如果您单独下载2016个人贡献数据和头文件,并将其解压缩,则可以将它们与以下代码组合使用:

import pandas as pd
import numpy as np

#Establish the file paths

#Column Headers:
header_filepath = 'C:/Election Year 2016/indiv_header_file.csv' 

#Inividual Contributions:       
contributions_filepath = 'C:/Election Year 2016/indiv16/itcont.txt' 

#Converted and joined to CSV file    
contributions_csv_filepath = 'C:/Election Year 2016/individual_contributions_2016.csv'         

#Read in the header file path
header_df = pd.read_csv(header_filepath)    

#Convert the TXT file to a readable Pandas dataframe while combining it with the column names in the header file.

contributions_df = pd.read_csv(contributions_filepath,encoding = "ISO-8859-1",sep='|',names=header_df,index_col=False)

#Write the converted file to CSV format. 
contributions_df.to_csv(contributions_csv_filepath)   

#Displays some of the contents of the dataframe. 
contributions_df.head(100)