Question

我目前正在尝试计算文件中字长的出现次数。该方法如下所示：

def count_words_of_each_length_in_a_file(file_path)
  hash = {}
  File.open(file_path,"r") do |f|
    f.each_line do |line|
      line.split(" ").each do |word|
        hash.key?(word.length) ? hash[word.length] += 1 : hash[word.length] = 1
      end
    end
  end
  hash
end

它没有返回预期值，有人能告诉我为什么或指向我更好的解决方案吗？

Answer 1

使用String#scan传递正则表达式中的任何单词或'字符：

scan(/[\w\']+/)

所以你的代码看起来像这样：

#script.rb

def count_words_of_each_length_in_a_file(file_path)
  hash = {}
  File.open(file_path,"r") do |f|
    f.each_line do |line|
      line.scan(/[\w\']+/).each do |word|
        hash.key?(word.length) ? hash[word.length] += 1 : hash[word.length] = 1
      end
    end
  end
  hash
end

实施例

#test.rb
o
tw tw
thr thr, thr thr
four four. four four
they've they've

然后运行你的程序：

count_words_of_each_length_in_a_file('./test.rb')
#=> {1=>1, 2=>2, 3=>4, 4=>4, 7=>2}

警告：上述解决方案是一个起点，但并非完全不漏水。例如，考虑带连字符的单词。处理这些类型的单词的规则是什么？

Answer 2

在我看来，您的代码唯一不对的是您不能首先删除标点符号。我们可能希望删除以下字符：

BAD_CHARS = '.?!,:;"-/'

根据需要添加其他字符。撇号/单引号字符存在问题。您可能希望将其保留用于收缩（＆＃34;不要＆＃34;），但将其移除以用于所有格（例如，＆＃34; Rufus＆＃39;＆＃34;或＆＃34; Sue＆＃ 39; s＆＃34;，后者产生单词＆＃34; Sues＆＃34;，又一个问题）和引用字符串（＆＃34;她说，＆＃39;迷路！＆＃39;＆＃34; ）。区分案件当然很困难。出于答案的目的，我不会删除撇号/单引号。

我建议您按如下方式编写方法。

<强>代码

def count_words_by_length(file_path)
  IO.foreach(file_path).with_object(Hash.new(0)) { |line, h|
    line.delete(BAD_CHARS).split.each { |word| h[word.length] += 1 } }
end

示例

str = "Let us wish the new President well,\neven if through gritted teeth." FName = "test" IO.write(FName, str) #=> 66 count_words_by_length(FName) #=> {3=>3, 2=>2, 4=>3, 9=>1, 7=>2, 5=>1}

<强>解释

也许解释这里发生的事情的最好方法是插入一些puts语句并重新运行代码。

def count_words_by_length(file_path) enum0 = IO.foreach(file_path) puts "enum0=#{enum0}" enum1 = enum0.with_object(Hash.new(0)) puts "enum1=#{enum1}" puts "enum1.to_a=#{enum1.to_a}" # Show elements to be generated by enumerator enum1.each do |line, h| puts "line=#{line}" puts " h=#{h}" str = line.delete(BAD_CHARS) puts " str=#{str}" arr = str.split puts " arr=#{arr}" arr.each do |word| h[word.length] += 1 puts " word=#{word.ljust(9)} length=#{word.length} h=#{h}" end end end count_words_by_length(FName)

然后

count_words_by_length(FName)

打印以下内容。

enum0=#<Enumerator:0x007ff782138130> enum1=#<Enumerator:0x007ff782138018> enum1.to_a=[["Let us wish the new President well,\n", {}], ["even if through gritted teeth.", {}]] line=Let us wish the new President well, h={} str=Let us wish the new President well arr=["Let", "us", "wish", "the", "new", "President", "well"] word=Let length=3 h={3=>1} word=us length=2 h={3=>1, 2=>1} word=wish length=4 h={3=>1, 2=>1, 4=>1} word=the length=3 h={3=>2, 2=>1, 4=>1} word=new length=3 h={3=>3, 2=>1, 4=>1} word=President length=9 h={3=>3, 2=>1, 4=>1, 9=>1} word=well length=4 h={3=>3, 2=>1, 4=>2, 9=>1} line=even if through gritted teeth. h={3=>3, 2=>1, 4=>2, 9=>1} str=even if through gritted teeth arr=["even", "if", "through", "gritted", "teeth"] word=even length=4 h={3=>3, 2=>1, 4=>3, 9=>1} word=if length=2 h={3=>3, 2=>2, 4=>3, 9=>1} word=through length=7 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>1} word=gritted length=7 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>2} word=teeth length=5 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>2, 5=>1}

IO.foreach和IO.write通常会写为File.foreach和File.write。这是允许的，因为File是IO（File < IO #=> true）的子类。

使用ruby计算文件中字长的出现次数

2 个答案:

实施例