Question

Take a look at this example:

<li><a href="http://website.com/">This is a website</a>, it belongs to John Sulliva</li>

I can get the content of the <li> tag by using:

nodeset = doc.css('li')

I also can get the text inside the <a> tag by using:

nodeset.each do |element|

  ahref = element.css('a') // <-- <a href="http://website.com/">This is a website</a>
  name = ahref.text.strip // <--This is a website
end

But how do I get the rest of the text within the <li> tag but without the text from the <a> tag?

From this example, I like to get

", it belongs to John Sullivan"

How can I do this?

Answer 1

使用XPath和text()节点测试很简单。如果您已将li提取到nodeset，则可以使用以下内容获取文字：

nodeset.xpath('./text()')

或者您可以直接从整个文档中获取它：

doc.xpath('//li/text()')

这使用text()节点测试作为te XPath表达式的一部分，而不是text Ruby方法。它提取li节点 direct 后代的任何文本节点，因此不包含a元素的内容。

Answer 2

我找到了一种便宜的方法来获取其余的文字：

  ahref = element.css('a')

  name = ahref.text.strip

  suppl =  element.text.strip.gsub(name, '')

Nokogiri：获取不在<a> tag

2 个答案: