Question

如何轻松解析具有此结构的文档

description
some line of text
another line of text
more lines of text

quality
3 47 88 4 4 4  4

text: type 1
stats some funny stats

description
some line of text2
another line of text2
more lines of text2

quality
1 2  4 6 7

text: type 1
stats some funny stats

.
.
.

理想情况下，我想要一个哈希结构数组，其中每个哈希表示文档的“部分”，可能应该如下所示：

{：description =＆gt; “一些文字另一行文字更多文字“，：quality =＆gt; “3 47 88 4 4 4 4”，：text =＆gt; type 1，：stats =＆gt; “一些有趣的统计数据”}

Answer 1

您应该在循环中查找指标行（描述，质量，文本和统计信息），并在逐行处理文档时填充哈希值。

另一种选择是使用正则表达式并一次解析文档，但是你不需要在这里使用正则表达式，如果你不熟悉它们，我必须建议不要使用正则表达式。

<强>更新

sections = []

File.open("deneme") do |f|
  current = {:description => "", :text => "", :quality => "", :stats => ""}
  inDescription = false
  inQuality = false

  f.each_line do |line|
    if inDescription
      if line.strip == ""
        inDescription = false
      else
        current[:description] += line
      end
    elsif inQuality
      current[:quality] = line.strip
      inQuality = false
    elsif line.strip == "description"
      inDescription = true
    elsif line.strip == "quality"
      inQuality = true
    elsif line.match(/^text: /)
      current[:text] = line[6..-1].strip
    elsif line.match(/^stats /)
      current[:stats] = line[6..-1].strip
      sections.push(current)
      current = {:description => "", :text => "", :quality => "", :stats => ""}
    end
  end
end

Answer 2

正则表达式版本：

ary = str.scan(/description\n(.*?)\n\nquality\n(.*?)\n\ntext:([^\n]+)\nstats([^\n]+)/m).inject([]) do |n, (desc, qual, text, stats)|
  n << { :description => desc.gsub("\n", ' '), :quality => qual, :text => text, :stats => stats }
end

Answer 3

您的输入看起来非常接近YAML，因此我将输入转换为有效的YAML（使用类似Can的方法），然后使用标准的ruby库加载它。然后，当您的用户遇到他们没有想到的精彩标记时，请告诉他们只使用YAML：）

Answer 4

一种解析技巧是以段落模式读取数据 - 一次一个块。如果您的子部分始终由2个换行符分隔（或者如果您可以使用预处理来强制执行此类一致性），则段落读取可能会有用。

除了“文本”子部分所需的特殊处理之外，下面的示例相当一般，只要求您声明最后一个子部分的名称。

# Paragraph mode.
$/ = "\n\n"

last_subsection = 'text'
data = []

until DATA.eof
    data.push({})
    while true
        line = DATA.readline

        # Determine which sub-section we are in.
        ss = nil
        line.sub!( %r"\A(\w+):?\s*" ) { ss = $1; '' }

        # Special handling: split the 'text' item into 2 subsections.
        line, data[-1]['stats'] = line.split(/\nstats +/, 2) if ss == 'text'

        data[-1][ss] = line
        break if ss == last_subsection
    end

    # Cleanup newlines however you like.
    data[-1].each_key { |k| data[-1][k].gsub!(/\n/, ' ') }
end

# Check
data.each { |d| puts; d.each { |k,v| puts k + ' => ' + v } }

__END__
# Data not shown here...

在Ruby中解析结构化文本文件

4 个答案: