我试图使用XPath从here抓取数据,虽然我使用inspect来复制路径并将/ text()添加到最后但是返回一个空列表而不是["Class 5"]
表示最后一个span标记之间的文本。
import requests
from lxml import html
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
r1class = tree.xpath('//*[@id="resultsListContainer"]/div[3]/table/tbody/tr[1]/td/span[1]/text()')
print(r1class)
我定位的元素是种族1(第5类)的类,结构与我使用的XPath相匹配。
答案 0 :(得分:1)
下面的代码应该完成这项工作,即在使用具有匹配XPath表达式的其他网站时它可以正常工作。 racenet 网站无法提供有效的HTML,这很可能是您的代码失败的原因。这可以通过使用W3C在线验证器来验证:https://validator.w3.org
import Alamofire
struct Communication {
static let foneKey = "fone"
static let emailKey = "email"
static let countryKey = "country"
enum Router: URLRequestConvertible {
case Init(String, String, String)
var URLRequest: NSMutableURLRequest {
let result: (path: String, method: Alamofire.Method, parameters: [String: AnyObject]) = {
switch self {
case .Init(let fone, let email, let country):
let params = [foneKey: fone, emailKey: email, countryKey: country]
return ("Init", .POST, params)
}
}()
let URL = NSURL(string: baseURLString)
let request = NSMutableURLRequest(URL: URL!.URLByAppendingPathComponent(result.path))
let encoding = ParameterEncoding.JSON
request.HTTPMethod = result.method.rawValue
request.URLRequest.HTTPMethod = result.method.rawValue
return encoding.encode(request, parameters: result.parameters).0
}
}
}
答案 1 :(得分:1)
这应该让你开始。
import requests
from lxml.etree import HTML
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16").content
tree = HTML(sample_page)
races = tree.xpath('//table[@class="tblLatestHorseResults"]')
for race in races:
rows = race.xpath('.//tr')
for row in rows:
row_text_as_list = [i.xpath('string()').replace(u'\xa0', u'') for i in row.xpath('.//td') if i is not None]
答案 2 :(得分:1)
您的XPath表达式与任何内容都不匹配,因为您尝试抓取的HTML页面严重受损。 FF(或任何其他Web浏览器)在显示之前修复页面。这会导致添加HTML标记,这些标记在原始文档中不存在。
以下代码包含一个XPath表达式,它很可能会指向正确的方向。
import requests
from lxml import html, etree
sample_page = requests.get("https://www.racenet.com.au/horse-racing-results/happy-valley/2016-11-16")
tree = html.fromstring(sample_page.content)
nodes = tree.xpath("//*[@id='resultsListContainer']/div/table[@class='tblLatestHorseResults']/tr[@class='raceDetails']/td/span[1]")
for node in nodes:
print etree.tostring(node)
执行时,会打印以下内容:
$ python test.py
<span class="bold">Class 5</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 4</span> Track:
<span class="bold">Class 3</span> Track:
<span class="bold">Class 2</span> Track:
<span class="bold">Class 3</span> Track:
提示:每当您尝试抓取网页时,事情都无法正常工作,请下载并将HTML保存到文件中。在这种情况下,例如:
f = open("test.xml", 'w')
f.write(sample_page.content)
然后看一下保存的HTML。这可以让您了解DOM的外观。