如何解析包含多行数据的文本文件,并按数值组织,然后转换为JSON

时间:2016-07-10 23:58:18

标签: ruby-on-rails ruby json text

我需要使用以下格式解析文本文件并将其转换为将转换为JSON的Hash。

文本文件具有以下格式:

HD040008000415350110XXXXXXXXXX0208XXXXXXXX0302EN0403USA0502EN0604000107014
EM04000800030010112TME001205IQ50232Blue Point Coastal Cuisine. INC.06145655th Avenue0805921010909SAN DIEGO1008Downtown1102CA1203USA

每一行都是一组使用Key值格式的段。例如,第二行是:

  • EM是关键
  • 04是值的长度,包括空格
  • 0008是值

打破它,它看起来像EM 04 0008。下一个段键是数字的,以00开头,然后递增直到行的末尾,然后重新开始。我需要遍历文本文件中的每一行。

我需要能够将其转换为Ruby哈希值,而后者又会在API响应中转换为JSON。

目前的格式是:

EM0400080003001

需要解析为:

{"EM" => 0008, "00" => "001"}

2 个答案:

答案 0 :(得分:2)

这是一种非常常见的编码类型,称为Type-Length-Value(或Tag-Length-Value),原因我认为很明显。与Ruby中的许多此类任务一样,String#unpack非常合适:

def decode(data)
  return {} if data.empty?
  key, len, rest = data.unpack("a2 a2 a*")
  val = rest.slice!(0, len.to_i)
  { key => val }.merge(decode(rest))
end

p decode("HD040008000415350110XXXXXXXXXX0208XXXXXXXX0302EN0403USA0502EN0604000107014")
# => {"HD"=>"0008", "00"=>"1535", "01"=>"XXXXXXXXXX", "02"=>"XXXXXXXX", "03"=>"EN", "04"=>"USA", "05"=>"EN", "06"=>"0001", "07"=>"4"}

p decode("EM04000800030010112TME001205IQ50232Blue Point Coastal Cuisine. INC.0614565 5th Avenue0805921010909SAN DIEGO1008Downtown1102CA1203USA")
# => {"EM"=>"0008", "00"=>"001", "01"=>"TME001205IQ5", "02"=>"Blue Point Coastal Cuisine. INC.", "06"=>"565 5th Avenue", "08"=>"92101", "09"=>"SAN DIEGO", "10"=>"Downtown", "11"=>"CA", "12"=>"USA"}

如果你想读取整个文件并返回一个JSON对象数组,那么这样就足够了:

#!/usr/bin/env ruby -n
BEGIN {
  require "json"
  def decode(data)
    # ...
  end
  arr = []
}

arr << decode($_.chomp)

END { puts arr.to_json }

然后(假设脚本被称为script.rb并且是可执行的:

$ cat data.txt | ./script.rb > out.json

答案 1 :(得分:1)

假设密钥有2个字符,长度为2个数字:

line = "EM04000800030010112TME001205IQ50232Blue Point Coastal Cuisine. INC.06145655th Avenue0805921010909SAN DIEGO1008Downtown1102CA1203USA"



hsh = {}
arr = line.chars
until arr.empty?
  key = arr.shift(2).join
  length = arr.shift(2).join.to_i
  value = arr.shift(length).join
  hsh[key] = value
end
hsh

 => {"EM"=>"0008", "00"=>"001", "01"=>"TME001205IQ5", "02"=>"Blue Point Coastal Cuisine. INC.", "06"=>"5655th Avenue0", "80"=>"21010909SAN DIEGO1008Downtown1102CA1203USA"} 

结果看起来有点时髦。

编辑 - 要按照以下步骤逐步浏览文件:

File.open(filename).each_line do |line|
  do stuff with line here
end