我使用Selenium从http://www.fedsdatacenter.com/federal-pay-rates/index.php?n=&l=&a=SECURITIES+AND+EXCHANGE+COMMISSION&o=&y=all抓取联邦员工职位和工资信息的动态Javascript表。 (注意:这是所有公共领域数据,所以不用担心:个人信息)。
我正试图让它进入Pandas DF进行分析。我的问题是我的Selenium输入数据是一个打印为:
的列表[u'DOE,JON'], [u'14'], [u'SK'], [u'$176,571.00'], [u'$2,000.00'], [u'SECURITIES AND EXCHANGE COMMISSION'], [u'WASHINGTON'], [u'GENERAL ATTORNEY'], [u'2012']], ...
我想要的是一个处理任意数量记录的DF 为:
NAME GRADE SCALE SALARY BONUS AGENCY LOCATION POSITION YEAR
Doe, Jon 14 SK $176,571.00 $2,000.00 SEC DC ATTY 2012
.
.
.
我已经尝试将此列表转换为字典,使用zip()函数将col名称作为元组,将数据作为列表等等,尽管这是一个很好的Python之旅特征。获取数据后应该采取下一步措施,或者我应该以不同的方式阅读数据?
目前,刮刀代码为:
from selenium import webdriver
path_to_chromedriver = '/Users/xxx/Documents/webdriver/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
url = 'http://www.fedsdatacenter.com/federal-pay-rates/index.php'
browser.get(url)
inputAgency = browser.find_element_by_id('a')
inputYear = browser.find_element_by_id('y')
# Send data
inputAgency.send_keys('SECURITIES AND EXCHANGE COMMISSION')
inputYear.send_keys('All')
# Select 'All' from Years element
browser.find_element_by_css_selector('input[type=\"submit\"]').click()
browser.find_element_by_xpath('//*[@id="example_length"]/label/select/option[4]').click()
SMRtable = browser.find_element_by_id('example')
scrapedData = []
for td in SMRtable.find_elements_by_xpath('.//td'):
scrapedData.append([td.get_attribute('innerHTML')])
print td.get_attribute('innerHTML')
答案 0 :(得分:1)
您只能使用pandas
。
首先,您可以查看网页的查看页面来源:
检查行号。 14807 - 14826:
// data table initialization
$(document).ready(function() {
$('#example').dataTable( {
"bPaginate": true,
"bFilter": false,
"bProcessing": true,
"bServerSide": true,
"aoColumns": [
null,
null,
null,
{ "sType": 'currency' }, // set currency columns to allow sorting
{ "sType": 'currency' }, // set second column to currency to allow sorting
null,
null,
null,
null
],
"sAjaxSource": "output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all"
} );
} );
这意味着此页面使用dataTables,数据从ajax源加载为JSON。
因此,不用废弃html,你可以得到干净漂亮的json:
output.php?n=&a=SECURITIES AND EXCHANGE COMMISSION&l=&o=&y=all
最终链接是(而space
使用%20
):
JSON:
{"sEcho":0,"iTotalRecords":"7072900","iTotalDisplayRecords":"19919","aaData":[
["ZUVER,SHAHEEN H","14","SK","$170,960.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"],
["ZUR,MIA C.","14","SK","$164,875.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","WASHINGTON","GENERAL ATTORNEY","2014"],
["ZUNDEL,JENNET LEONG","14","SK","$204,638.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","SAN FRANCISCO","ACCOUNTING","2014"],
["ZUKOWSKI,DAVID W","04","SK","$38,382.00","$0.00","SECURITIES AND EXCHANGE COMMISSION","BOSTON","ADMIN AND OFFICE SUPPORT STUDENT TRAINEE","2014"],
...
所以你可以用read_json
pandas解析这个json:
import pandas as pd
df = pd.read_json("http://www.fedsdatacenter.com/federal-pay-rates/output.php?n=&a=SECURITIES%20AND%20EXCHANGE%20COMMISSION&l=&o=&y=all")
print df.head()
aaData iTotalDisplayRecords \
0 [ZUVER,SHAHEEN H, 14, SK, $170,960.00, $0.00, ... 19919
1 [ZUR,MIA C., 14, SK, $164,875.00, $0.00, SECUR... 19919
2 [ZUNDEL,JENNET LEONG, 14, SK, $204,638.00, $0.... 19919
3 [ZUKOWSKI,DAVID W, 04, SK, $38,382.00, $0.00, ... 19919
4 [ZOU,FAN, 14, SK, $166,650.00, $0.00, SECURITI... 19919
iTotalRecords sEcho
0 7072900 0
1 7072900 0
2 7072900 0
3 7072900 0
4 7072900 0
然后您从列aaData
获取新的DataFrame - 使用list comprehension:
df1 = pd.DataFrame([ x for x in df['aaData'] ])
设置列名称:
df1.columns = ['NAME','GRADE','SCALE','SALARY','BONUS','AGENCY','LOCATION','POSITION','YEAR']
print df1.head()
NAME GRADE SCALE SALARY BONUS \
0 ZUVER,SHAHEEN H 14 SK $170,960.00 $0.00
1 ZUR,MIA C. 14 SK $164,875.00 $0.00
2 ZUNDEL,JENNET LEONG 14 SK $204,638.00 $0.00
3 ZUKOWSKI,DAVID W 04 SK $38,382.00 $0.00
4 ZOU,FAN 14 SK $166,650.00 $0.00
AGENCY LOCATION \
0 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
1 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
2 SECURITIES AND EXCHANGE COMMISSION SAN FRANCISCO
3 SECURITIES AND EXCHANGE COMMISSION BOSTON
4 SECURITIES AND EXCHANGE COMMISSION WASHINGTON
POSITION YEAR
0 GENERAL ATTORNEY 2014
1 GENERAL ATTORNEY 2014
2 ACCOUNTING 2014
3 ADMIN AND OFFICE SUPPORT STUDENT TRAINEE 2014
4 INFORMATION TECHNOLOGY MANAGEMENT 2014