Question

我将规则加载到数据库并将它们放在树层次结构中。当XML Im scraping设置如下时，刮掉它是微不足道的：

<CHAPTER>
  <PART>
    <SUBPART>
      <SECTION>
        <HD>Section Title</HD>
      </SECTION>
      <SECTION> ... </SECTION>
      <APPENDIX>
        <HD>Appendix Title</HD>
        <P>Appendix content...</P>
        <FOO>More content in unexpected tags</FOO>
      </APPENDIX>
    </SUBPART>
    <SUBPART> ... </SUBPART>
  </PART>
</CHAPTER>

由于我必须知道父ID是什么，所以我在这一行上做了一些事情：

parent_id = 1

doc.xpath("//chapter/part/subpart").each do |subpart| 
  title = subpart.xpath("hd").first.text

  # add is a method that creates object and saves it to database, returning its id
  id = add(title,'SUBPART',parent_id)

  subpart.xpath('section').each do |section|
    title = section.xpath('hd').first.text
    add(title,'SECTION',id)
  end

  subpart.xpath('appendix').each do |app|
    title = section.xpath('hd').first.text
    content = app.to_s
    add(title,'APPENDIX',id,content) #content is an optional input
  end
end

但是，XML并不总是以这种逻辑方式设置。有时，附录不包含在标签中:(

在这种情况下，XML看起来像这样：

<EXTRACT>
  <HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
  <NOTE>
    <HD SOURCE="HED">Note:</HD>
    <P>The requirements applicable to construction work under this   appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
  </NOTE>
  <HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
  <NOTE>
    <HD SOURCE="HED">Note:</HD>
    <P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
  </NOTE>
  <HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
  <NOTE>
    <HD SOURCE="HED">Note:</HD>
    <P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
  </NOTE>
  <HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
  <NOTE>
    <HD SOURCE="HED">Note:</HD>
    <P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
  </NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>

我能想到提取这些附录的唯一方法是遍历<EXTRACT>节点并检查标签以查看其名称是否为＆＃34; HD＆＃34;和＆＃34;附录＆＃34;在文中。然后保存所有内容，直到我点击下一个<HD>＆＃34;附录＆＃34;在文中。

感觉就像一个非常笨重的解决方案。有更好的方法吗？

Nokogiri提取物对象未包含在标签

0 个答案: