Question

我有一堆管道分隔的文件，在生成时没有正确转义回车，因此我无法使用CR或换行符来分隔行。但我知道每条记录必须有7个字段。

使用Ruby 1.9中的CSV库设置'col_sep'参数可以轻松拆分字段，但无法设置'row_sep'参数，因为我在字段中有换行符。

有没有办法使用固定数量的字段作为行分隔符来解析竖线分隔的文件？

谢谢！

Answer 1

这是一种方法：

构建一个包含七个单词的示例字符串，其中包含一个嵌入的换行符中间的字符串。有三行值得。

text = (["now is the\ntime for all good"] * 3).join(' ').gsub(' ', '|')
puts text
# >> now|is|the
# >> time|for|all|good|now|is|the
# >> time|for|all|good|now|is|the
# >> time|for|all|good

这样的过程：

lines = []
chunks = text.gsub("\n", '|').split('|')
while (chunks.any?)
  lines << chunks.slice!(0, 7).join(' ')
end

puts lines
# >> now is the time for all good
# >> now is the time for all good
# >> now is the time for all good

因此，这表明我们可以重建行。

假设单词实际上是管道分隔文件中的列，我们可以通过取出.join(' ')来使代码真实存在：

while (chunks.any?)
  lines << chunks.slice!(0, 7)
end

ap lines
# >> [
# >>     [0] [
# >>         [0] "now",
# >>         [1] "is",
# >>         [2] "the",
# >>         [3] "time",
# >>         [4] "for",
# >>         [5] "all",
# >>         [6] "good"
# >>     ],
# >>     [1] [
# >>         [0] "now",
# >>         [1] "is",
# >>         [2] "the",
# >>         [3] "time",
# >>         [4] "for",
# >>         [5] "all",
# >>         [6] "good"
# >>     ],
# >>     [2] [
# >>         [0] "now",
# >>         [1] "is",
# >>         [2] "the",
# >>         [3] "time",
# >>         [4] "for",
# >>         [5] "all",
# >>         [6] "good"
# >>     ]
# >> ]

Answer 2

比如说你想解析管道分隔的IRS txt文件中的所有慈善机构。

假设您有一个名为Charity的模型，它具有与管道分隔文件相同的所有字段。

class Charity < ActiveRecord::Base
   # http://apps.irs.gov/app/eos/forwardToPub78DownloadLayout.do
   # http://apps.irs.gov/app/eos/forwardToPub78Download.do
   attr_accessible :city, :country, :deductibility_status, :deductibility_status_description, :ein, :legal_name, :state
end

您可以执行名为import.rake的rake任务

namespace :import do

  desc "Import Pipe Delimted IRS 5013c Data "
  task :irs_data => :environment do

    require 'csv'

    txt_file_path = 'db/irs_5013cs.txt'
    results = File.open(txt_file_path).readlines do |line|
      line = line.split('|').each_slice(7)
    end

    # Order Field Notes
    # 1  EIN   Required
    # 2  Legal Name  Optional
    # 3  City  Optional
    # 4  State   Optional
    # 5  Deductibility Status  Optional
    # 6  Country   Optional - If Country is null, then Country is assumed to be   United   States
    # 7  Deductibility Status Description  Optional

    results.each do |row|
      row = row.split('|').each_slice(7).to_a.first
      #ID,Category,Sub Category,State Standard
      Charity.create!({
        :ein                              => row[0],
        :legal_name                       => row[1],
        :city                             => row[2],
        :state                            => row[3],
        :deductibility_status             => row[4],
        :country                          => row[5],
        :deductibility_status_description => row[6]
      })
    end
  end
end

最后，您可以通过在rails应用程序的命令行中输入以下命令来运行此导入

 rake import:irs_data

Answer 3

这是一个想法，使用正则表达式：

#!/opt/local/bin/ruby

fp = File.open("pipe_delim.txt")
r1 = /.*?\|.*?\|.*?\|.*?\|.*?\|.*?\|.*?\|/m
results = fp.gets.scan(r1)
results.each do |result|
  puts result
end

这个正则表达式似乎在一个字段内的换行符上绊倒，但我相信你可以调整它以使其正常工作。

Answer 4

只是一个想法，但cucumber测试gem有一个Cucumber::Ast::Table类可用于处理此文件。

Cucumber::Ast::Table.new(File.read(file))

然后我认为这是你可以用来读出它的rows方法。

Answer 5

尝试使用String#split和Enumerable#each_slice：

result = []
text.split('|').each_slice(7) { |record| result << record }

每行读取固定数量的管道分隔字段？

5 个答案: