Ruby - 有效地解析文本文件

时间:2014-03-03 12:24:23

标签: html ruby string parsing

我正在尝试解析href中的HTML个代码。基本上我正在尝试获取URL和描述。我还尝试按空格分割描述并计算每个单词出现的数量,最后将它们写成两个单独的文件。我的解析器工作正常,但效率非常低,我会说它会在2分钟内解析1MB的文本。

以下是我的代码:

hrefTag = "<a href=\""
qtMark = "\""
descStart = "\">"
hrefEnd = "</a>"
if line.include? hrefTag 
    dest = line[/#{hrefTag}(.*?)#{qtMark}/m, 1]
    descStIn = line.rindex(descStart)
    descEndIn = line.rindex(hrefEnd)
    if (descStIn != nil && descEndIn != nil)
        desc = line[(descStIn+2)..(descEndIn-1)]
    end
end
if (source != "" && dest != "")
    occur = Hash.new(0)
    mainEntry = "original-url=\"" + source + 
    "\", dest-url=\"" + dest + "\"" 
    descEntry = ""
    if (desc != nil && desc != "")
        descEntry = ", desc=\"" + desc + "\""
        words = desc.split(' ')
        words.each { |word| occur[word] += 1 }
    end
    firstEntry = mainEntry+descEntry+"\n\n"
    File.open(firstOutput, 'a') { |file| 
        file.write(firstEntry) 
    }
    occur.each { |word, occurrance| 
        wordEntry = ", word=\"" + word +
        "\", count=" + occurrance.to_s
        secondEntry = mainEntry+wordEntry+"\n\n"
        File.open(secondOutput, 'a') { |file| 
            file.write(secondEntry) 
        }
    }

如何提高效率?哪些部分效率最低?

1 个答案:

答案 0 :(得分:0)

要了解花费最多时间的内容,请使用ruby-prof或类似工具对代码进行分析。安装ruby-prof:

gem install ruby-prof

运行它来调用你的脚本:

ruby-prof <script.rb>

当你的脚本完成时(或你是CTRL-C),它总结了方法调用,每种方法所花费的时间等。这是一个输出片段:

Sort by: self_time

 %self      total      self      wait     child     calls  name
  8.67      0.008     0.008     0.000     0.000        2   JSON::Ext::Parser#parse 
  8.45      0.022     0.008     0.000     0.014       99   IO#read_nonblock 
  6.66      0.006     0.006     0.000     0.000       99   <Module::Kernel>#select 
  2.78      0.003     0.003     0.000     0.000      235   IO#write 
  1.17      0.001     0.001     0.000     0.000       57   Enumerator#next 
  0.99      0.049     0.001     0.000     0.048      207  *Array#each