我有一堆财务报告,我需要从中提取一些特定的表格。我一直在使用Ruby(我对编程很新)来实现这一目标,到目前为止,我使用关键字匹配系统(表中的匹配项)获得了良好的结果。但是,我想使用第二种方法,其中对表格上方的p元素的文本进行关键字筛选,如果找到匹配,则应将表格放入文件中。我一直在寻找(主要是使用xpath(这个链接真的很有帮助:https://www.simple-talk.com/dotnet/.net-framework/xpath,-css,-dom-and-selenium-the-rosetta-stone/)),不幸的是我没有走得太远。我需要处理的工作表示例如下:
https://www.sec.gov/Archives/edgar/data/1583671/000106299315005260/form10k.htm#page_15
将这些p元素与相关的资产负债表表格相对应,例如:
<p align="center"><b>SCIENCE TO CONSUMERS, INC. </b></p>
<p align="center"><b>BALANCE SHEET </b><br>
</p>
<table style="BORDER-COLOR: black; FONT-SIZE: 10pt; BORDER-COLLAPSE: collapse; " width="100%" border="0" cellpadding="0" cellspacing="0">
<tbody><tr valign="top">
<td align="left"> </td>
<td align="left" width="1%"> </td>
<td align="center" width="12%" nowrap=""><b>May 31,</b> </td>
<td align="center" width="2%" nowrap=""> </td>
<td align="center" width="1%" nowrap=""> </td>
<td align="center" width="12%" nowrap=""><b>May 31, 2014</b> </td>
<td align="left" width="2%"> </td></tr>
<tr valign="top">
<td align="center"><b>ASSETS</b> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="12%" nowrap=""><b>2015</b> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="2%" nowrap=""> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="1%" nowrap=""> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="center" width="12%" nowrap=""> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td></tr>
<tr valign="top">
<td align="left">Current Assets </td>
<td align="left" width="1%"> </td>
<td align="left" width="12%"> </td>
<td align="left" width="2%"> </td>
<td align="left" width="1%"> </td>
<td align="left" width="12%"> </td>
<td align="left" width="2%"> </td></tr>
<tr valign="top">
<td align="left" bgcolor="#E6EFFF"> Cash and cash equivalents </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#E6EFFF">$</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#E6EFFF">1,749 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#E6EFFF"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#E6EFFF">$</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#E6EFFF"> 5,171 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#E6EFFF"> </td></tr>
<tr valign="top">
<td align="left">Total Current Assets </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">1,749 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">5,171 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td></tr>
<tr>
<td bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td></tr>
<tr>
<td> </td>
<td width="1%"> </td>
<td width="12%"> </td>
<td width="2%"> </td>
<td width="1%"> </td>
<td width="12%"> </td>
<td width="2%"> </td></tr>
<tr valign="top">
<td align="left" bgcolor="#e6efff">Total Assets </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%" bgcolor="#e6efff">$</td>
<td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%" bgcolor="#e6efff"> 1,749 </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%" bgcolor="#e6efff"> </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%" bgcolor="#e6efff">$</td>
<td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%" bgcolor="#e6efff"> 5,171 </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%" bgcolor="#e6efff"> </td></tr>
<tr>
<td> </td>
<td width="1%"> </td>
<td width="12%"> </td>
<td width="2%"> </td>
<td width="1%"> </td>
<td width="12%"> </td>
<td width="2%"> </td></tr>
<tr valign="top">
<td align="center" bgcolor="#e6efff"><b>LIABILITIES AND STOCKHOLDERS’
EQUITY</b> </td>
<td align="left" width="1%" bgcolor="#e6efff"> </td>
<td align="left" width="12%" bgcolor="#e6efff"> </td>
<td align="left" width="2%" bgcolor="#e6efff"> </td>
<td align="left" width="1%" bgcolor="#e6efff"> </td>
<td align="left" width="12%" bgcolor="#e6efff"> </td>
<td align="left" width="2%" bgcolor="#e6efff"> </td></tr>
<tr valign="top">
<td align="left">Liabilities </td>
<td align="left" width="1%"> </td>
<td align="left" width="12%"> </td>
<td align="left" width="2%"> </td>
<td align="left" width="1%"> </td>
<td align="left" width="12%"> </td>
<td align="left" width="2%"> </td></tr>
<tr valign="top">
<td align="left" bgcolor="#e6efff">Current Liabilities </td>
<td align="left" width="1%" bgcolor="#e6efff"> </td>
<td align="left" width="12%" bgcolor="#e6efff"> </td>
<td align="left" width="2%" bgcolor="#e6efff"> </td>
<td align="left" width="1%" bgcolor="#e6efff"> </td>
<td align="left" width="12%" bgcolor="#e6efff"> </td>
<td align="left" width="2%" bgcolor="#e6efff"> </td></tr>
<tr valign="top">
<td align="left"> Loan from director </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,891</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,217 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td></tr>
<tr>
<td bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td></tr>
<tr valign="top">
<td align="left">Total Liabilities </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,891 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">8,217 </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%"> </td></tr>
<tr>
<td bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td></tr>
<tr valign="top">
<td align="left">Stockholders’ Equity </td>
<td align="left" width="1%"> </td>
<td align="left" width="12%"> </td>
<td align="left" width="2%"> </td>
<td align="left" width="1%"> </td>
<td align="left" width="12%"> </td>
<td align="left" width="2%"> </td></tr>
<tr valign="top">
<td align="left" bgcolor="#e6efff"> Common stock,
par value $0.001; 525,000,000 shares
authorized, <br> 29,900,000 shares
issued and outstanding; </td>
<td align="left" width="1%" bgcolor="#e6efff"> </td>
<td align="right" width="12%" bgcolor="#e6efff">29,900 </td>
<td align="left" width="2%" bgcolor="#e6efff"> </td>
<td align="left" width="1%" bgcolor="#e6efff"> </td>
<td align="right" width="12%" bgcolor="#e6efff">29,750 </td>
<td align="left" width="2%" bgcolor="#e6efff"> </td></tr>
<tr valign="top">
<td align="left"> Additional paid in capital </td>
<td align="left" width="1%"> </td>
<td align="right" width="12%">61,100</td>
<td align="left" width="2%"> </td>
<td align="left" width="1%"> </td>
<td align="right" width="12%">16,250 </td>
<td align="left" width="2%"> </td></tr>
<tr valign="top">
<td align="left" bgcolor="#e6efff"> Deficit accumulated
during the development stage </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#e6efff"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#e6efff">(98,142</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#e6efff">) </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%" bgcolor="#e6efff"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%" bgcolor="#e6efff">(49,046</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%" bgcolor="#e6efff">) </td></tr>
<tr valign="top">
<td align="left">Total Stockholders’ Equity </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">(7,142</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">) </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="1%"> </td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="right" width="12%">(3,046</td>
<td style="BORDER-BOTTOM: #000000 1px solid" align="left" width="2%">) </td></tr>
<tr>
<td bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td>
<td width="1%" bgcolor="#e6efff"> </td>
<td width="12%" bgcolor="#e6efff"> </td>
<td width="2%" bgcolor="#e6efff"> </td></tr>
<tr valign="top">
<td align="left">Total Liabilities and Stockholders’ Equity </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%">$</td>
<td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%"> 1,749 </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%"> </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="1%">$</td>
<td style="BORDER-BOTTOM: #000000 3px double" align="right" width="12%"> 5,171 </td>
<td style="BORDER-BOTTOM: #000000 3px double" align="left" width="2%"> </td></tr></tbody></table>
所以它应该例如检测“BALANCE SHEET”文本,然后将表写入文件。
这是我到目前为止所发现的:
output = File.open("output.htm", 'w')
htm = File.open( "a.htm", "r+" )
htm = Nokogiri::HTML(open(htm)) do |config|
config.noblanks
end
allelements = htm.xpath('//table | //p')
allelements.each_with_index do |element, index|
if element.xpath('//table//*[contains(text(),\'Balance\')]')
output.puts element
#if element.xpath('//p//*[contains(text(),\'Balance\')]')
#check next five elements and if one equals "table" then
#write that table to the output file.
end
end
显然这段代码是不完整的,但即使这不起作用,因为输出文件包含我不理解的所有p-和table-元素(我希望只有表元素放在输出文件中)这一点)。
感谢您阅读本文,欢迎任何想法/评论!
答案 0 :(得分:0)
我在找到nokogiri“.name”方法后解决了这个问题,这种方法让它变得平和。此代码有效:
require 'rubygems'
require 'nokogiri'
output = File.open("output.htm", 'w')
financial_file = File.open( "a.htm", "r+" )
original_financial_file_downcased = File.read(financial_file).downcase
downcased_financial_file = File.open("downcased_financial_file.htm", "w+" )
original_financial_file_downcased.each_line do |line|
downcased_financial_file.puts line
end
downcased_financial_file.flush
nokogiri_cleaned_financial_report = Nokogiri::HTML(open(downcased_financial_file)) do |config|
config.noblanks
end
allelements = nokogiri_cleaned_financial_report.xpath('//table | //p')
allelements.each_with_index do |element, index|
if element.name == "table"
number_of_p_elements_to_assess = 5
number_of_p_elements_to_assess.times do
if /balance sheets/i.match(allelements[index-number_of_p_elements_to_assess].text)
output.puts element
end
number_of_p_elements_to_assess -= 1
end
end
感谢所有为阅读帖子而烦恼的人。
此致
臼井