尽管使用/ text()已经

时间:2016-11-19 11:11:32

标签: python xpath

我试图使用XPath从here抓取数据,虽然我使用inspect来复制路径并将/ text()添加到最后但是返回一个空列表而不是["Class 5"]表示最后一个span标记之间的文本。

import requests
from lxml import html

sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
r1class = tree.xpath('//*[@id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')

print(r1class)

我定位的元素是种族1(第5类)的类,结构与我使用的XPath相匹配。

3 个答案:

答案 0 :(得分:1)

下面的代码应该完成这项工作,即在使用具有匹配XPath表达式的其他网站时它可以正常工作。 racenet 网站无法提供有效的HTML,这很可能是您的代码失败的原因。这可以通过使用W3C在线验证器来验证:https://validator.w3.org

import Alamofire

struct Communication {

    static let foneKey = "fone"
    static let emailKey = "email"
    static let countryKey = "country"

    enum Router: URLRequestConvertible {
        case Init(String, String, String)

        var URLRequest: NSMutableURLRequest {

            let result: (path: String, method: Alamofire.Method, parameters: [String: AnyObject]) = {

                switch self {
                case .Init(let fone, let email, let country):
                    let params = [foneKey: fone, emailKey: email, countryKey: country]
                    return ("Init", .POST, params)
                }
            }()

            let URL = NSURL(string: baseURLString)
            let request = NSMutableURLRequest(URL: URL!.URLByAppendingPathComponent(result.path))
            let encoding = ParameterEncoding.JSON
            request.HTTPMethod = result.method.rawValue
            request.URLRequest.HTTPMethod = result.method.rawValue

            return encoding.encode(request, parameters: result.parameters).0
        }
    }
}

答案 1 :(得分:1)

这应该让你开始。

import requests
from lxml.etree import HTML

sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16").content
tree = HTML(sample_page)
races = tree.xpath('//table[@class="tblLatestHorseResults"]')
for race in races:
    rows = race.xpath('.//tr')
    for row in rows:
        row_text_as_list = [i.xpath('string()').replace(u'\xa0', u'') for i in row.xpath('.//td') if i is not None]

答案 2 :(得分:1)

您的XPath表达式与任何内容都不匹配,因为您尝试抓取的HTML页面严重受损。 FF(或任何其他Web浏览器)在显示之前修复页面。这会导致添加HTML标记,这些标记在原始文档中不存在。

以下代码包含一个XPath表达式,它很可能会指向正确的方向。

import requests
from lxml import html, etree
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
nodes = tree.xpath("//*[@id='resultsListContainer']/div/table[@class='tblLatestHorseResults']/tr[@class='raceDetails']/td/span[1]")
for node in nodes:
    print etree.tostring(node)

执行时,会打印以下内容:

$ python test.py
<span class="bold">Class 5</span> Track: 
<span class="bold">Class 4</span> Track: 
<span class="bold">Class 4</span> Track: 
<span class="bold">Class 4</span> Track: 
<span class="bold">Class 4</span> Track: 
<span class="bold">Class 3</span> Track: 
<span class="bold">Class 2</span> Track: 
<span class="bold">Class 3</span> Track: 

提示:每当您尝试抓取网页时,事情都无法正常工作,请下载并将HTML保存到文件中。在这种情况下,例如:

f = open("test.xml", 'w')
f.write(sample_page.content)

然后看一下保存的HTML。这可以让您了解DOM的外观。