Question

我试图从与此类似的很多页面中获取平均GPA数据以及更多数据：

http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers

我的问题是gpa_headers是零，但至少有一个h3元素包含＆＃34; GPA＆＃34;。

可能导致此问题的原因是什么？我认为可能是因为页面有动态元素，而Mechanize遇到了一些问题，但我可以puts page.body并且输出包括：

... <h3 style="text-align:center;">GPA REQUIREMENT</h3> ...

根据我的理解，应该找到我使用的xpath。

如果有更好的方法，我也想知道。

Answer 1

这看起来是网站DOM结构的问题，因为它包含一个名为style的标记，它不会被关闭，如下所示：

<td colspan='7'><style='text-align:center;font-style:italic'>The
institution has been granted Candidate for Accreditation status by the
Commission on Accreditation in Physical Therapy Education (1111 North
Fairfax Street, Alexandria, VA, 22314; phone: 703.706.3245; email: <a
href='mailto:accreditation@apta.org'>accreditation@apta.org</a>).
Candidacy is not an accreditation status nor does it assure eventual
accreditation. Candidate for Accreditation is a pre-accreditation
status of affiliation with the Commission on Accreditation in Physical
Therapy Education that indicates the program is progressing toward
accreditation.</td>

正如您所看到的，td标记已关闭，但内部style从未执行过。

如果您不需要这部分代码，我建议您在尝试使用整个response之前将其删除。我没有ruby的经验，但我会做类似的事情：

获取回复的原始内容。
用空字符串替换与此正则表达式'(<style=\'.*)</td>'匹配的部分，或自行关闭标记。
使用这个新的响应机构。

现在您可以使用xpath选择器了。

Answer 2

eLRuLL给出了上述问题的根源。以下是我如何解决问题的示例：

def camelize(string)
  string.split('_').map(&:capitalize).join
end

这将返回我在上面寻找的标题：

require 'mechanize'
require 'nokogiri'

agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
mangled_text = page.body
fixed_text = mangled_text.sub(/<style=.+?<\/td>/, "</td>")
page = Nokogiri::HTML(fixed_text)
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers

Answer 3

更可靠的解决方案是使用像nokogumbo这样的HTML5解析器：

require 'nokogumbo'
doc = Nokogiri::HTML5(page.body)
gpa_headers = doc.search('//h3[contains(text(), "GPA")]')

使用Mechanize / Nogokiri按文字搜索

3 个答案: