Python:使用XPath从表

时间:2016-03-28 02:11:38

标签: python xpath

我正在尝试从http://projects.fivethirtyeight.com/election-2016/delegate-targets/底部的表中获取数据。

import requests
from lxml import html

url = "http://projects.fivethirtyeight.com/election-2016/delegate-targets/"
response = requests.get(url) 
doc = html.fromstring(response.text) 


tables = doc.findall('.//table[@class="delegates desktop"]')
election = tables[0] 
election_rows = election.findall('.//tr')
def extractCells(row, isHeader=False):
    if isHeader:
        cells = row.findall('.//th')
    else:
        cells = row.findall('.//td')
    return [val.text_content() for val in cells]

import pandas

def parse_options_data(table):
    rows = table.findall(".//tr")
    header = extractCells(rows[1], isHeader=True)
    data = [extractCells(row, isHeader=False) for row in rows[2:]]
    return pandas.DataFrame(data, columns=header)

election_data = parse_options_data(election)
election_data

我遇到了候选人姓名最高的一行('特朗普','克鲁兹','卡西奇')。它在tr class =“top”下,现在我只有tr class =“bottom”(以“won / target”行开头)。

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:0)

候选人姓名在第0行:

candidates = [val.text_content() for val in rows[0].findall('.//th')[1:]]

或者,如果重复使用相同的extractCells()函数:

candidates = extractCells(rows[0], isHeader=True)[1:]
这里的

[1:]切片是跳过第一个空的th单元格。

答案 1 :(得分:0)

不好(硬编码),但按照你想要的方式运行。

def parse_options_data(table):
    rows = table.findall(".//tr")
    candidate = extractCells(rows[0], isHeader=True)[1:]                                                                                                                                             
    header = extractCells(rows[1], isHeader=True)[:3] + candidate
    data = [extractCells(row, isHeader=False) for row in rows[2:]]
    return pandas.DataFrame(data, columns=header)