Ruby / Rails文本解析为对象

时间:2014-04-12 02:57:21

标签: ruby-on-rails ruby regex parsing object

我正在尝试从下面的文本(.srt字幕文件)中的每个重复集创建对象:

1
00:02:12,446 --> 00:02:14,406
The Hovitos are near.

2
00:02:15,740 --> 00:02:18,076
The poison is still fresh,
three days.

3
00:02:18,076 --> 00:02:19,744
They're following us.

例如,我可以使用三行或四行并将它们分配给新对象的属性。所以对于第一组,我可以Sentence.create(number: 1, time_marker: '00:02:12', content: "The Hovitos are near.")

script.each_line开始,还有哪些其他一般结构可能会让我走上正轨?我很难用这个,任何帮助都会很棒!

修改

到目前为止,我所遇到的一些杂乱未完成的代码如下。它确实有效(我认为)。你会采取完全不同的路线吗?我对此没有任何经验。

number = nil
time_marker = nil
content = []

script = script.strip
script.each_line do |line|
  line = line.strip
  if line =~ /^\d+$/
    number = line.to_i
  elsif line =~ /-->/
    time_marker = line[0..7]
  elsif line =~ /^\b\D/
    content << line
  else
    if content.size > 1
      content = content.join("\n") 
    else
      content = content[0]
    end

    Sentence.create(movie: @movie, number: number, 
      time_marker: time_marker, content: content)
    content = []
  end
end

2 个答案:

答案 0 :(得分:1)

假设字幕位于以下变量中:

subtitles = %q{1
00:02:12,446 --> 00:02:14,406
The Hovitos are near.

2
00:02:15,740 --> 00:02:18,076
The poison is still fresh,
three days.

3
00:02:18,076 --> 00:02:19,744
They're following us.}

然后,你可以这样做:

def split_subs subtitles
  grouped, splitted = [], []
  subtitles.split("\n").push("\n").each do |sub|
    if sub.strip.empty?
      splitted.push({
        number: grouped[0],
        time_marker: grouped[1].split(",").first,
        content: grouped[2..-1].join(" ")
      })
      grouped = []
    else
      grouped.push sub.strip
    end
  end
  splitted
end

puts split_subs(subtitles)

# output:
# ➲ ruby 23025546.rb                                  [10:00:07] ▸▸▸▸▸▸▸▸▸▸
# {:number=>"1", :time_marker=>"00:02:12", :content=>"The Hovitos are near."}
# {:number=>"2", :time_marker=>"00:02:15", :content=>"The poison is still fresh, three days."}
# {:number=>"3", :time_marker=>"00:02:18", :content=>"They're following us."}

答案 1 :(得分:1)

这是一种可以做到的方法:

File.read('subtitles.srt').split(/^\s*$/).each do |entry| # Read in the entire text and split on empty lines
  sentence = entry.strip.split("\n")
  number = sentence[0] # First element after empty line is 'number'
  time_marker =  sentence[1][0..7] # Second element is 'time_marker'
  content = sentence[2..-1].join("\n") # Everything after that is 'content'
end