我有以下代码,效果很好:
rows = diary_HTML.xpath('//*[@id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
detail = {}
[
["Food", 'td[1]/text()'],
["Calories", 'td[2]/text()'],
["Carbs", 'td[3]/text()'],
["Fat", 'td[4]/text()'],
["Protein", 'td[5]/text()'],
["Cholest", 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
然而,“食物”td不仅包括文字,还包括我想要获取文字的链接。
我知道我可以使用'td[1]/a/text()'
来获取链接文字,但我该怎么做?
'td[1]/a/text()' or 'td[1]/text()'
已编辑 - 已添加代码段
我试图在第一行包含<tr class="meal_header">
<td class="first alt">Breakfast</td>
,其他行包含其他常规tds,而排除底行的td1。
<tr class="meal_header">
<td class="first alt">Breakfast</td>
<td class="alt">Calories</td>
<td class="alt">Carbs</td>
<td class="alt">Fat</td>
<td class="alt">Protein</td>
<td class="alt">Sodium</td>
<td class="alt">Sugar</td>
</tr>
<tr>
<td class="first alt">
<a onclick="showEditFood(3992385560);" href="#">Hovis (Uk - White Bread (40g) Toasted With Flora Light Marg, 2 slice</a> </td>
<td>262</td>
<td>36</td>
<td>9</td>
<td>7</td>
<td>0</td>
<td>3</td>
</tr>
<tr class="bottom">
<td class="first alt" style="z-index: 10">
<a href="/food/add_to_diary?meal=0" class="add_food">Add Food</a>
<div class="quick_tools">
<a href="#quick_tools_0" class="toggle_diary_options">Quick Tools</a>
<div id="quick_tools_0" class="quick_tools_options hidden">
<ul>
<li><a onclick="showLightbox(200, 250, '/food/quick_add?meal=0&date=2013-04-15'); return false;">Quick add calories</a></li>
<li><a href="/meal/new?meal=0">Remember meal</a></li>
<li><a href="/food/copy_meal?date=2013-04-15&from_date=2013-04-14&meal=0&username=nickwild1">Copy yesterday</a></li>
<li><a href="#recent_meals_0" class="toggle_diary_options">Copy from date</a></li>
<li><a href="#recent_meals_copy_to_0" class="toggle_diary_options">Copy to date</a></li>
</ul>
</div>
<div id="recent_meals_0" class="recent_meal_options hidden">
<ul id="recent_meal_options_0">
<li class="header">Copy from which date?</li>
<li><a href="/food/copy_meal?date=2013-04-15&from_date=2013-04-14&meal=0&username=nickwild1">Sunday, April 14</a></li>
<li><a href="/food/copy_meal?date=2013-04-15&from_date=2013-04-13&meal=0&username=nickwild1">Saturday, April 13</a></li>
</ul>
</div>
</div>
</td>
<td>285</td>
<td>39</td>
<td>9</td>
<td>10</td>
<td>0</td>
<td>3</td>
<td></td>
答案 0 :(得分:2)
简短的回答是:使用Nokogiri::XML::Element#text
,它会给出元素和子元素的文本(例如你的a
)。
你也可以清理一下这段代码:
keys = ["Food", "Calories", "Carbs", "Fat", "Protein", "Cholest"]
food_diary = rows.collect do |row|
Hash[keys.zip row.search('td').map(&:text)]
end
作为最后一个提示,避免在xtml中使用xpath,css会更好。
答案 1 :(得分:1)
我认为你可以通过在xpath中没有明确的text()
提取时改变查看元素内容的逻辑来实现这一点
rows = diary_HTML.xpath('//*[@id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
detail = {}
[
["Food", 'td[1]'],
["Calories", 'td[2]/text()'],
["Carbs", 'td[3]/text()'],
["Fat", 'td[4]/text()'],
["Protein", 'td[5]/text()'],
["Cholest", 'td[6]/text()'],
].each do |name, xpath|
if xpath.include?('/text()')
detail[name] = row.at_xpath(xpath).to_s.strip
else
detail[name] = row.at_xpath(xpath).content.strip
end
end
detail
end
您还可以添加例如数组的符号,描述您如何提取数据,并有一个case
块,根据xpath
请注意,您也可以通过递归地遍历xpath返回的节点结构来执行您想要的操作,但如果您只是想忽略标记,链接等,这似乎有点过分。