xpath从&链接中选择文本来自的文字

时间:2013-04-19 07:54:52

标签: ruby xpath

我有以下代码,效果很好:

rows = diary_HTML.xpath('//*[@id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
  detail = {}
  [
    ["Food", 'td[1]/text()'],   
    ["Calories", 'td[2]/text()'],
    ["Carbs", 'td[3]/text()'],
    ["Fat", 'td[4]/text()'],
    ["Protein", 'td[5]/text()'],
    ["Cholest", 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end

然而,“食物”td不仅包括文字,还包括我想要获取文字的链接。

我知道我可以使用'td[1]/a/text()'来获取链接文字,但我该怎么做?

'td[1]/a/text()' or 'td[1]/text()'

已编辑 - 已添加代码段

我试图在第一行包含<tr class="meal_header"> <td class="first alt">Breakfast</td>,其他行包含其他常规tds,而排除底行的td1。

<tr class="meal_header">
  <td class="first alt">Breakfast</td>
  <td class="alt">Calories</td>
  <td class="alt">Carbs</td>
  <td class="alt">Fat</td>
  <td class="alt">Protein</td>
  <td class="alt">Sodium</td>
  <td class="alt">Sugar</td>
</tr>
<tr>  
<td class="first alt">            
  <a onclick="showEditFood(3992385560);" href="#">Hovis (Uk - White Bread (40g) Toasted With Flora Light Marg, 2 slice</a> </td>
  <td>262</td>   
  <td>36</td>
  <td>9</td>
  <td>7</td>
  <td>0</td>
  <td>3</td>
</tr>
<tr class="bottom">
  <td class="first alt" style="z-index: 10">
    <a href="/food/add_to_diary?meal=0" class="add_food">Add Food</a>
    <div class="quick_tools">
    <a href="#quick_tools_0" class="toggle_diary_options">Quick Tools</a>
    <div id="quick_tools_0" class="quick_tools_options hidden">
    <ul>
      <li><a onclick="showLightbox(200, 250, '/food/quick_add?meal=0&amp;date=2013-04-15'); return false;">Quick add calories</a></li>
     <li><a href="/meal/new?meal=0">Remember meal</a></li>
     <li><a href="/food/copy_meal?date=2013-04-15&amp;from_date=2013-04-14&amp;meal=0&amp;username=nickwild1">Copy yesterday</a></li>  
     <li><a href="#recent_meals_0" class="toggle_diary_options">Copy from date</a></li>             
     <li><a href="#recent_meals_copy_to_0" class="toggle_diary_options">Copy to date</a></li>
    </ul>
    </div>
   <div id="recent_meals_0" class="recent_meal_options hidden">
    <ul id="recent_meal_options_0">
    <li class="header">Copy from which date?</li>        
    <li><a href="/food/copy_meal?date=2013-04-15&amp;from_date=2013-04-14&amp;meal=0&amp;username=nickwild1">Sunday, April 14</a></li>
    <li><a href="/food/copy_meal?date=2013-04-15&amp;from_date=2013-04-13&amp;meal=0&amp;username=nickwild1">Saturday, April 13</a></li>
    </ul>
    </div>
    </div>
  </td>
  <td>285</td>
  <td>39</td>
  <td>9</td>
  <td>10</td>
  <td>0</td>
  <td>3</td>
  <td></td>

2 个答案:

答案 0 :(得分:2)

简短的回答是:使用Nokogiri::XML::Element#text,它会给出元素和子元素的文本(例如你的a)。

你也可以清理一下这段代码:

keys = ["Food", "Calories", "Carbs", "Fat", "Protein", "Cholest"]
food_diary = rows.collect do |row|
  Hash[keys.zip row.search('td').map(&:text)]
end

作为最后一个提示,避免在xtml中使用xpath,css会更好。

答案 1 :(得分:1)

我认为你可以通过在xpath中没有明确的text()提取时改变查看元素内容的逻辑来实现这一点

rows = diary_HTML.xpath('//*[@id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
  detail = {}
  [
    ["Food", 'td[1]'],   
    ["Calories", 'td[2]/text()'],
    ["Carbs", 'td[3]/text()'],
    ["Fat", 'td[4]/text()'],
    ["Protein", 'td[5]/text()'],
    ["Cholest", 'td[6]/text()'],
  ].each do |name, xpath|
    if xpath.include?('/text()')
      detail[name] = row.at_xpath(xpath).to_s.strip
    else
      detail[name] = row.at_xpath(xpath).content.strip
    end
  end
  detail
end

您还可以添加例如数组的符号,描述您如何提取数据,并有一个case块,根据xpath

之后的最后一个阶段处理项目

请注意,您也可以通过递归地遍历xpath返回的节点结构来执行您想要的操作,但如果您只是想忽略标记,链接等,这似乎有点过分。