如果前面的p元素中的文本包含特定术语,如何使用nokogiri选择表格?

时间:2016-07-03 19:06:00

标签: html ruby xpath nokogiri

我有一堆财务报告,我需要从中提取一些特定的表格。我一直在使用Ruby(我对编程很新)来实现这一目标,到目前为止,我使用关键字匹配系统(表中的匹配项)获得了良好的结果。但是,我想使用第二种方法,其中对表格上方的p元素的文本进行关键字筛选,如果找到匹配,则应将表格放入文件中。我一直在寻找(主要是使用xpath(这个链接真的很有帮助:https://www.simple-talk.com/dotnet/.net-framework/xpath,-css,-dom-and-selenium-the-rosetta-stone/)),不幸的是我没有走得太远。我需要处理的工作表示例如下:

https://www.sec.gov/Archives/edgar/data/1583671/000106299315005260/form10k.htm#page_15

将这些p元素与相关的资产负债表表格相对应,例如:

<p align="center"><b>SCIENCE TO CONSUMERS, INC. </b></p>
<p align="center"><b>BALANCE SHEET </b><br>
</p>
<table style="BORDER-COLOR: black; FONT-SIZE: 10pt; BORDER-COLLAPSE: collapse; " width="100%" border="0" cellpadding="0" cellspacing="0">

  <tbody><tr valign="top">
    <td align="left">&nbsp; </td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="center" width="12%" nowrap=""><b>May 31,</b> </td>
    <td align="center" width="2%" nowrap="">&nbsp;</td>
    <td align="center" width="1%" nowrap="">&nbsp;</td>
    <td align="center" width="12%" nowrap=""><b>May 31, 2014</b> </td>
  <td align="left" width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="center"><b>ASSETS</b> </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="12%" nowrap=""><b>2015</b> </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="2%" nowrap="">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="1%" nowrap="">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="12%" nowrap="">&nbsp; </td>
  <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">Current Assets </td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="left" width="12%">&nbsp; </td>
    <td align="left" width="2%">&nbsp;</td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="left" width="12%">&nbsp; </td>
    <td align="left" width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left" bgcolor="#E6EFFF">&nbsp; &nbsp; &nbsp;Cash and cash equivalents </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#E6EFFF">$</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#E6EFFF">1,749 </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#E6EFFF">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#E6EFFF">$</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#E6EFFF">&nbsp;5,171 </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#E6EFFF">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">Total Current Assets </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">1,749    </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">5,171    </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td></tr>
  <tr>
    <td bgcolor="#e6efff">&nbsp; </td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr>
    <td>&nbsp; </td>
    <td width="1%">&nbsp;</td>
    <td width="12%">&nbsp; </td>
    <td width="2%">&nbsp;</td>
    <td width="1%">&nbsp;</td>
    <td width="12%">&nbsp; </td>
    <td width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left" bgcolor="#e6efff">Total Assets </td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%" bgcolor="#e6efff">$</td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%" bgcolor="#e6efff">&nbsp;1,749 </td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%" bgcolor="#e6efff">$</td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%" bgcolor="#e6efff">&nbsp;5,171 </td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr>
    <td>&nbsp; </td>
    <td width="1%">&nbsp;</td>
    <td width="12%">&nbsp; </td>
    <td width="2%">&nbsp;</td>
    <td width="1%">&nbsp;</td>
    <td width="12%">&nbsp; </td>
    <td width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="center" bgcolor="#e6efff"><b>LIABILITIES AND STOCKHOLDERS’
      EQUITY</b> </td>
    <td align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td align="left" width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td align="left" width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">Liabilities </td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="left" width="12%">&nbsp; </td>
    <td align="left" width="2%">&nbsp;</td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="left" width="12%">&nbsp; </td>
    <td align="left" width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left" bgcolor="#e6efff">Current Liabilities </td>
    <td align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td align="left" width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td align="left" width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">&nbsp; &nbsp; &nbsp;Loan from director </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,891</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,217    </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td></tr>
  <tr>
    <td bgcolor="#e6efff">&nbsp; </td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">Total Liabilities </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,891    </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,217    </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">&nbsp;</td></tr>
  <tr>
    <td bgcolor="#e6efff">&nbsp; </td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">Stockholders’ Equity </td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="left" width="12%">&nbsp; </td>
    <td align="left" width="2%">&nbsp;</td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="left" width="12%">&nbsp; </td>
    <td align="left" width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left" bgcolor="#e6efff">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Common stock,
      par value $0.001; 525,000,000 shares
      authorized,&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;29,900,000 shares
      issued and outstanding; </td>
    <td align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td align="right" width="12%" bgcolor="#e6efff">29,900 </td>
    <td align="left" width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td align="right" width="12%" bgcolor="#e6efff">29,750 </td>
    <td align="left" width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">&nbsp; &nbsp; &nbsp;Additional paid in capital </td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="right" width="12%">61,100</td>
    <td align="left" width="2%">&nbsp;</td>
    <td align="left" width="1%">&nbsp;</td>
    <td align="right" width="12%">16,250 </td>
    <td align="left" width="2%">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left" bgcolor="#e6efff">&nbsp; &nbsp; &nbsp;Deficit accumulated
      during the development stage </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#e6efff">(98,142</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#e6efff">) </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#e6efff">(49,046</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#e6efff">) </td></tr>
  <tr valign="top">
    <td align="left">Total Stockholders’ Equity </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">(7,142</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">) </td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">(3,046</td>
    <td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">) </td></tr>
  <tr>
    <td bgcolor="#e6efff">&nbsp; </td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td>
    <td width="1%" bgcolor="#e6efff">&nbsp;</td>
    <td width="12%" bgcolor="#e6efff">&nbsp; </td>
    <td width="2%" bgcolor="#e6efff">&nbsp;</td></tr>
  <tr valign="top">
    <td align="left">Total Liabilities and Stockholders’ Equity </td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%">$</td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%">&nbsp;1,749 </td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%">&nbsp;</td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%">$</td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%">&nbsp;5,171 </td>
    <td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%">&nbsp;</td></tr></tbody></table>

所以它应该例如检测“BALANCE SHEET”文本,然后将表写入文件。

这是我到目前为止所发现的:

output = File.open("output.htm", 'w')
htm = File.open( "a.htm", "r+" )
htm = Nokogiri::HTML(open(htm)) do |config|
   config.noblanks
end    
allelements = htm.xpath('//table | //p')
allelements.each_with_index do |element, index|
   if element.xpath('//table//*[contains(text(),\'Balance\')]') 
      output.puts element
   #if element.xpath('//p//*[contains(text(),\'Balance\')]') 
   #check next five elements and if one equals "table" then 
   #write that table to the output file.
   end
end

显然这段代码是不完整的,但即使这不起作用,因为输出文件包含我不理解的所有p-和table-元素(我希望只有表元素放在输出文件中)这一点)。

感谢您阅读本文,欢迎任何想法/评论!

1 个答案:

答案 0 :(得分:0)

我在找到nokogiri“.name”方法后解决了这个问题,这种方法让它变得平和。此代码有效:

require 'rubygems'
require 'nokogiri'

output = File.open("output.htm", 'w')
financial_file = File.open( "a.htm", "r+" )

original_financial_file_downcased = File.read(financial_file).downcase
downcased_financial_file = File.open("downcased_financial_file.htm", "w+" )
original_financial_file_downcased.each_line do |line|
downcased_financial_file.puts line
end      
downcased_financial_file.flush 

nokogiri_cleaned_financial_report = Nokogiri::HTML(open(downcased_financial_file)) do |config|
config.noblanks
end


allelements = nokogiri_cleaned_financial_report.xpath('//table | //p')
allelements.each_with_index do |element, index|
if element.name == "table"
   number_of_p_elements_to_assess = 5    
   number_of_p_elements_to_assess.times do 
      if /balance sheets/i.match(allelements[index-number_of_p_elements_to_assess].text)
         output.puts element
      end
   number_of_p_elements_to_assess -= 1
   end      
end

感谢所有为阅读帖子而烦恼的人。

此致

臼井