Question

我正在使用Nokiguri来解析.xml中的数据。这是数据文件的一个例外：

 <table>
               <tr>
                    <th class="indent normal">Profit and loss account</th>
                    <td class="notefigure"></td>
                    <td id="currentProfitAndLossAccount" class="figure">
                (<ix:nonFraction name="uk-gaap-pt:ProfitLossAccountReserve" contextRef="current-mud" unitRef="currencyUnit" format="ixt:numdotdecimal" decimals="0" sign="-" >12,345</ix:nonFraction><span class="endnegmark">)</span>
              </td>
                    <td id="previousProfitAndLossAccount" class="figure">
                        (<ix:nonFraction name="uk-gaap-pt:ProfitLossAccountReserve" contextRef="previous-mud" unitRef="currencyUnit" format="ixt:numdotdecimal" decimals="0" sign="-" >67,890</ix:nonFraction><span class="endnegmark">)</span>
                        </td>
             </tr>
 </table>

这是我正在使用的代码：

require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'

# this is how we request the page we're going to scrape
page = File.open("D:/accounts_file.xml") { |f| Nokogiri::XML(f) }

#this is the empty array to store the output
companies_array = []

# this is where the data is parsed

page.css('table').css('th').map do |a|
    post_name = a.text
    companies_array.push(post_name)
end

page.css('table').css('td').map do |a|
    post_name = a.text
    companies_array.push(post_name)
end

# this pushes the data into the .csv file
CSV.open('D:/financial_data','w') do |csv|
    csv << companies_array
end

目前，我得到一个表格标题行，然后是表格内容，但它没有与标题对齐，即使它确实如此，也远非理想。

我理想的是，（例如=＆＃34; currentProfitAndLossAccount＆＃34;）后跟列表中的相应值，如下所示：

＆＃34; currentProfitAndLossAccount＆＃34;＆＃34; 12345＆＃34; ＆＃34; previousProfitAndLossAccount＆＃34;＆＃34; 67890＆＃34;

有或没有分隔符。

实际上我要整理大约20个字段。然后将它导入我的数据库是一件容易的事。我有100k文件要导入，但我已经挣扎了一个多星期，将第一个文件转换成正确的格式导入。

这是我对stackoverflow的第一个问题，尽管我每天都在使用它。如果我没有以正确的方式询问，请保持温和。

感谢您的帮助。

在Ronan Lopes的帮助下，我非常感激，我现在有以下内容.rb

require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'

# this is how we request the page we're going to scrape
page = File.open("D:/Accounts.xml") { |f| Nokogiri::HTML(f) }

#this is an empty array where we will store the output
companies_array = []

# this is where we select the data we want to isolate

page.css('nonFraction').map{|n| { n.parent.attributes["id"].value => n.text } }

###这是工作的一部分，我认为###

post_name = n

# the next push command appends whatever is in the brackets to the companies_array storage
    companies_array.push(post_name)

# this will push the storage into a csv file
CSV.open('D:/accounts.csv','w') do |csv|
    csv << companies_array
end

我花了3个小时试图为自己解决这个问题。非常感谢任何帮助，并将节省一个不眠之夜！

Answer 1

不知道这是否适用于您的所有表格，但对于那个表格，这可以给出您想要的内容（至少它会为您提供其他表格的想法）：

我将其解析为HTML而不是xml：

page = Nokogiri::HTML(File.open("D:/Accounts.xml").read)

并且，为了得到你想要的那些值：

page.css('nonfraction').map{|n| { n.parent.attributes["id"].value => n.text } }

它为您提供了所需键/值的哈希值。希望这有帮助！

如何使用ruby on rails和Nokogiri选择数据？

1 个答案: