用XPath找到连续的兄弟姐妹

时间:2012-09-14 21:50:29

标签: xml xpath nokogiri

对于XPath专家来说,这是一个简单的观点! :)

文件结构:

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

忽略文档的语义不可能性,我想拉出[[“Newt”,“Gingrich”],[“Garry”,“Trudeau”]],即:当连续有两个令牌时entityTypes是PROPER_NOUN,我想从这两个标记中提取单词。

我已经达到了:

"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"

...它可以找到两个连续的PROPER_NOUN标记中的第二个,但我不知道如何让它随之发出第一个标记。

一些注意事项:

  • 我不介意对NodeSets进行更高级别的处理(例如在Ruby / Nokogiri中),如果这样可以简化问题。
  • 如果连续三个或更多个PROPER_NOUN令牌(称为A,B,C),理想情况下我想发出[A,B],[B,C]。

更新

这是我使用更高级别Ruby功能的解决方案。但是我厌倦了那些在我脸上踢沙子的XPath恶霸,我想知道REAL XPath程序员的方式!

def extract(doc)
  names = []
  sentences = doc.xpath("//tokens")
  sentences.each do |sentence| 
    tokens = sentence.xpath("token")
    prev = nil
    tokens.each do |token|
      name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
      names << [prev, name] if (name && prev)
      prev = name
    end
  end
  names
end

3 个答案:

答案 0 :(得分:1)

我分两步完成。第一步是选择一组节点:

//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]

这为您提供了启动双字对的所有token个。然后获取实际对,迭代节点列表并提取./wordfollowing-sibling::token[1]/word

使用XmlStarlet(http://xmlstar.sourceforge.net/ - 用于快速xml操作的强大工具)命令行是

xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml 

Newt,Gingrich
Garry,Trudeau

XmlStarlet还会将该命令行编译为xslt,相关位为

  <xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
    <xsl:value-of select="word"/>
    <xsl:value-of select="','"/>
    <xsl:value-of select="following-sibling::token[1]/word"/>
    <xsl:value-of select="'&#10;'"/>
  </xsl:for-each>

使用Nokogiri它可能看起来像:

#parse the document
doc = Nokogiri::XML(the_document_string)

#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'

#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
  array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end

答案 1 :(得分:0)

XPath返回节点或节点集,但不返回组。所以你必须确定每个小组的开始,然后抓住其余小组。

first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"

doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }

输出:

[["Newt", "Gingrich"], ["Garry", "Trudeau"]]

答案 2 :(得分:0)

单独使用XPath对于此任务来说不够强大。但是在XSLT中这很容易:

<xsl:for-each-group select="token" group-adjacent="entityType">
  <xsl:if test="current-grouping-key="PROPER_NOUN">
     <xsl:copy-of select="current-group">
     <xsl:text>====</xsl:text>
  <xsl:if>
</xsl:for-each-group>