使用mechanize解析html页面以接收适当的数组

时间:2012-03-10 05:30:02

标签: html ruby parsing nokogiri mechanize

我在mechanize(agent.get)收到的页面上有以下html代码:

<div class="b-resumehistorylist-views">

<!-- first date start-->

<div class="b-resumehistory-date">date1</div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time1</div>
<a href="company_lynk1">company1</a></div>


<!-- second date start -->

<div class="b-resumehistory-date">date2</div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time2</div>
<a href="company_lynk2">company2</a>
</div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time3</div>
<a href="company_lynk3">company3</a></div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time4</div>
<a href="company_lynk4">company4</a></div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time5</div>
<a href="company_lynk5">company5</a></div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time6</div>
<a href="company_lynk6">company6</a></div>

<div class="b-resumehistory-company">
<div class="b-resumehistory-time">time7</div>
<a href="company_lynk7">company7</a></div>

...

</div>

我需要在每个日期使用class =“b-resumehistorylist-views”在div中搜索。 然后查找两个div-dates之间的所有div,并将每个项目链接到此特定日期。

问题是每个项目(div class = b-resumehistorylist-views)都不在div = b-resumehistorylist-views中。

在最后阶段,我需要收到以下数组: array = [ [date1, time1, company1, companylink1], [date2, time2, company2, companylink2], [date2, time3, company3, companylink3],[date2, time4, company4, companylink4] ]

我知道我必须使用带有text()选项的方法搜索,但我找不到解决方案。 我的代码现在可以解析div class = b-resumehistory-company之间的所有公司信息,但我需要找到正确的日期。

1 个答案:

答案 0 :(得分:1)

这与以前一样,只是改变了一些类属性:

doc = agent.get(someurl).parser
doc.css('.b-resumehistory-company').map{|x| [x.at('./preceding-sibling::div[@class="b-resumehistory-date"][1]').text , x.at('.b-resumehistory-time').text, x.at('a').text, x.at('a')[:href]]}