Nokogiri:不规则的div

时间:2014-05-19 17:31:05

标签: ruby xpath nokogiri

尝试处理div元素中的不规则内容。即h3标题之后的内容。 h3标题下没有固定的内容。但是,我需要将任何文本与标题相关联。可能有一个ul或只是一个跨度或两者兼而有之。主要的是没有结合h3标题下的所有文本。

我已经能够使用.css运算符导航到我的div。如果有多个评论,则每个div包含4个h3标题中的一个或多个,后跟评论或列表。

如何分隔在下一个标记之前结束的h3标记之后的任何内容(如果有的话)?

你可以看到我在这里工作的div的样本(我可以抓住h2之间的任何内容,因为每个div都是一样的):

   <div class="inspection_container">
  <h2 class="inspection_date_title">
    <div class="calendar_list">
      <span>Mar</span><strong>4</strong>
    </div>Routine Inspection<small>Inspected Mar. 4, 2014</small>
  </h2>
  <h3>Actions taken by inspector</h3>
  <ul>
    <li class="Comment">
      <strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
    </li>
  </ul>
</div>

<div class="inspection_container">
  <h2 class="inspection_date_title">
    <div class="calendar_list">
      <span>Sep</span><strong>4</strong>
    </div>Re-inspection<small>Inspected Sep. 4, 2013</small>
  </h2>
  <h3>Not in compliance</h3>
  <ul>
    <li class="X">
      <strong>Premise is clean/sanitary</strong><p>Food premise is to be maintained in a clean and sanitary condition.</p>
    </li>
  </ul>
  <h3>Actions taken by inspector</h3>
  <ul>
    <li class="Comment">
      <strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
    </li>
  </ul>
</div>

<div class="inspection_container">
  <h2 class="inspection_date_title">
    <div class="calendar_list">
      <span>Aug</span><strong>30</strong>
    </div>Routine Inspection<small>Inspected Aug. 30, 2013</small>
  </h2>
  <h3>Not in compliance</h3>
  <ul>
    <li class="X">
      <strong>Washrooms are cleaned regularly</strong><p>Washrooms are to be kept clean, sanitary, in good repair and must be supplied with liquid soap in a dispenser, single service/paper towels, cloth roller towel or hot air dryer and hot and cold running water.</p>
    </li>
    <li class="X">
      <strong>Building interior is well-maintained</strong><p>Walls, floors and ceilings are to be maintained and in good repair.</p>
    </li>
    <li class="X">
      <strong>Premise is clean/sanitary</strong><p>Food premise is to be maintained in a clean and sanitary condition.</p>
    </li>
  </ul>
  <h3>Actions taken by inspector</h3>
  <ul>
    <li class="Comment">
      <strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
    </li>
  </ul>
</div>

1 个答案:

答案 0 :(得分:0)

提供:

  • 您只有交织在一起的h3ul元素,直到包装div结束
  • 此结构中没有其他元素可以显示,而不是ul
  • 此结构中没有其他元素可以显示,而不是h3

并且您的示例具有代表性,这应该可以解决问题。

//ul[count(following-sibling::h3) = count(following-sibling::ul)]

如果其他元素与ul位于同一位置,但h3之间只有一个元素,则可以使用此表达式

//ul[count(following-sibling::h3) = count(following-sibling::*[not(local-name() = 'h3')])]

至于立即对h3元素和ul元素进行分组,我不认为这在单独的XPath中是可行的。你需要在Ruby中做到这一点。我建议搜索div元素并强制解析它们,同时计算节点并将奇数和偶数h3ul组合在一起