使用ruby计算文件中字长的出现次数

时间:2016-11-12 16:18:08

标签: ruby

我目前正在尝试计算文件中字长的出现次数。该方法如下所示:

def count_words_of_each_length_in_a_file(file_path)
  hash = {}
  File.open(file_path,"r") do |f|
    f.each_line do |line|
      line.split(" ").each do |word|
        hash.key?(word.length) ? hash[word.length] += 1 : hash[word.length] = 1
      end
    end
  end
  hash
end

它没有返回预期值,有人能告诉我为什么或指向我更好的解决方案吗?

2 个答案:

答案 0 :(得分:2)

使用String#scan传递正则表达式中的任何单词或'字符:

scan(/[\w\']+/)

所以你的代码看起来像这样:

#script.rb

def count_words_of_each_length_in_a_file(file_path)
  hash = {}
  File.open(file_path,"r") do |f|
    f.each_line do |line|
      line.scan(/[\w\']+/).each do |word|
        hash.key?(word.length) ? hash[word.length] += 1 : hash[word.length] = 1
      end
    end
  end
  hash
end

实施例

#test.rb
o
tw tw
thr thr, thr thr
four four. four four
they've they've

然后运行你的程序:

count_words_of_each_length_in_a_file('./test.rb')
#=> {1=>1, 2=>2, 3=>4, 4=>4, 7=>2}

警告:上述解决方案是一个起点,但并非完全不漏水。例如,考虑带连字符的单词。处理这些类型的单词的规则是什么?

答案 1 :(得分:0)

在我看来,您的代码唯一不对的是您不能首先删除标点符号。我们可能希望删除以下字符:

BAD_CHARS = '.?!,:;"-/'

根据需要添加其他字符。撇号/单引号字符存在问题。您可能希望将其保留用于收缩("不要"),但将其移除以用于所有格(例如," Rufus'"或" Sue&# 39; s",后者产生单词" Sues",又一个问题)和引用字符串("她说,'迷路!'" )。区分案件当然很困难。出于答案的目的,我不会删除撇号/单引号。

我建议您按如下方式编写方法。

<强>代码

def count_words_by_length(file_path)
  IO.foreach(file_path).with_object(Hash.new(0)) { |line, h|
    line.delete(BAD_CHARS).split.each { |word| h[word.length] += 1 } }
end

示例

str = "Let us wish the new President well,\neven if through gritted teeth."    
FName = "test"
IO.write(FName, str)
  #=> 66

count_words_by_length(FName)
  #=> {3=>3, 2=>2, 4=>3, 9=>1, 7=>2, 5=>1}

<强>解释

也许解释这里发生的事情的最好方法是插入一些puts语句并重新运行代码。

def count_words_by_length(file_path)
  enum0 = IO.foreach(file_path)
  puts "enum0=#{enum0}"
  enum1 = enum0.with_object(Hash.new(0))
  puts "enum1=#{enum1}"
  puts "enum1.to_a=#{enum1.to_a}" # Show elements to be generated by enumerator
  enum1.each do |line, h|
    puts "line=#{line}"
    puts "  h=#{h}"
    str = line.delete(BAD_CHARS)
    puts "  str=#{str}"
    arr = str.split
    puts "  arr=#{arr}"
    arr.each do |word|
      h[word.length] += 1
      puts "    word=#{word.ljust(9)} length=#{word.length} h=#{h}" 
    end
  end  
end
count_words_by_length(FName)

然后

count_words_by_length(FName)

打印以下内容。

enum0=#<Enumerator:0x007ff782138130>
enum1=#<Enumerator:0x007ff782138018>
enum1.to_a=[["Let us wish the new President well,\n", {}],
            ["even if through gritted teeth.", {}]]
line=Let us wish the new President well,
  h={}
  str=Let us wish the new President well
  arr=["Let", "us", "wish", "the", "new", "President", "well"]
    word=Let       length=3 h={3=>1}
    word=us        length=2 h={3=>1, 2=>1}
    word=wish      length=4 h={3=>1, 2=>1, 4=>1}
    word=the       length=3 h={3=>2, 2=>1, 4=>1}
    word=new       length=3 h={3=>3, 2=>1, 4=>1}
    word=President length=9 h={3=>3, 2=>1, 4=>1, 9=>1}
    word=well      length=4 h={3=>3, 2=>1, 4=>2, 9=>1}
line=even if through gritted teeth.
  h={3=>3, 2=>1, 4=>2, 9=>1}
  str=even if through gritted teeth
  arr=["even", "if", "through", "gritted", "teeth"]
    word=even      length=4 h={3=>3, 2=>1, 4=>3, 9=>1}
    word=if        length=2 h={3=>3, 2=>2, 4=>3, 9=>1}
    word=through   length=7 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>1}
    word=gritted   length=7 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>2}
    word=teeth     length=5 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>2, 5=>1}

IO.foreachIO.write通常会写为File.foreachFile.write。这是允许的,因为FileIOFile < IO #=> true)的子类。