用nokogiri / ruby​​计算一首诗的行号

时间:2011-06-06 14:39:23

标签: nokogiri

我一直在努力尝试用一个简单的正则表达式做到这一点,但它从未如此准确。它不一定非常完美。

Source包含

标签的组合。我不想算空行。

旧方式:

  self.words = rendered.gsub(/<p>&nbsp;<\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1

新方式(不工作: 尝试将所有的+ +转换为段落,然后将它扔进nokogiri来计算其中包含超过3个字符的段落标记(我不知道如何计算?计算1个字母的行也会很好,但这在的JavaScript)

  h = rendered
  h.gsub!(/<br>\s*<br>/gi,"<p>")
  h.gsub!(/<br>/gi,"<p>") if h =~ /<br>\s*<br>/
  h.prepend "<p>" if !h =~ /^\s*<p[^>]*>/i
  h.replace(/<p>\s*<p>/g,"<p>&nbsp;</p><p>")
  Nokogiri::HTML(rendered)
  # find+count p tags with at least 1-3 chars?

  # this is javascript not ruby, but you get the idea
  $('p', c).each(function(i) { // had to trim it to remove whitespaces from start/end.
    if ($(this).children('img').length) return; // skip if it's just an image.
    if ($.trim($(this).text()).length > 3)
      $(this).append("<div class='num'>"+ (n += 1) +"</div>");
  })

欢迎其他方法!

示例诗(http://allpoetry.com/poem/7429983-the_many_endings-by-Kevin

<p>
    from the other side of silence<br>
    you met me with change and a pocket<br>
    of unhappy apples.</p>
<p>
     </p>
<p>
    <br>
    we bled together to black<br>
    and chose the path carefully to<br>
    france.<br><br>
    sometimes when you smile<br>
    your radiant footsteps fall<br>
    and all around us is silence:<br>
    each dream step is<br>
    false but full of such glory</p>
<p>
     </p>
<p>
    <br>
    unhappiness never made a student of you:<br>
    just two by two by two.  now three<br>
    this great we that overflows our<br>
    heart-cave<br><br>
    each jewel-like addition to the delicate<br>
    crown.  but flowers fall and dreams,<br>
    all dreams, come to and end with death.</p> 

谢谢!

1 个答案:

答案 0 :(得分:0)

对于后人来说,这就是我现在正在使用的内容,它似乎非常准确。非拉丁字符有时会从ckeditor引起一些问题,所以我现在正在剥离它们。

  html = Nokogiri::HTML(rendered)
  text = html.at('body').inner_text rescue nil
  return self.words = rendered.gsub(/<p>&nbsp;<\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1 if !text

  #bonus points to strip lines entirely non-letter. idk

  #d "text is", text.gsub!(/([\x09|\x0D|\t])|(\xc2\xa0){1,}|[^A-z]/u,'')
  text.gsub!(/[^A-z\n]/u,'')
  #d "text is", text
  self.words = text.strip.scan(/(\s*\n\s*)+/).size+1