我目前正在尝试计算文件中字长的出现次数。该方法如下所示:
def count_words_of_each_length_in_a_file(file_path)
hash = {}
File.open(file_path,"r") do |f|
f.each_line do |line|
line.split(" ").each do |word|
hash.key?(word.length) ? hash[word.length] += 1 : hash[word.length] = 1
end
end
end
hash
end
它没有返回预期值,有人能告诉我为什么或指向我更好的解决方案吗?
答案 0 :(得分:2)
使用String#scan
传递正则表达式中的任何单词或'
字符:
scan(/[\w\']+/)
所以你的代码看起来像这样:
#script.rb
def count_words_of_each_length_in_a_file(file_path)
hash = {}
File.open(file_path,"r") do |f|
f.each_line do |line|
line.scan(/[\w\']+/).each do |word|
hash.key?(word.length) ? hash[word.length] += 1 : hash[word.length] = 1
end
end
end
hash
end
#test.rb
o
tw tw
thr thr, thr thr
four four. four four
they've they've
然后运行你的程序:
count_words_of_each_length_in_a_file('./test.rb')
#=> {1=>1, 2=>2, 3=>4, 4=>4, 7=>2}
警告:上述解决方案是一个起点,但并非完全不漏水。例如,考虑带连字符的单词。处理这些类型的单词的规则是什么?
答案 1 :(得分:0)
在我看来,您的代码唯一不对的是您不能首先删除标点符号。我们可能希望删除以下字符:
BAD_CHARS = '.?!,:;"-/'
根据需要添加其他字符。撇号/单引号字符存在问题。您可能希望将其保留用于收缩("不要"),但将其移除以用于所有格(例如," Rufus'"或" Sue&# 39; s",后者产生单词" Sues",又一个问题)和引用字符串("她说,'迷路!'" )。区分案件当然很困难。出于答案的目的,我不会删除撇号/单引号。
我建议您按如下方式编写方法。
<强>代码强>
def count_words_by_length(file_path)
IO.foreach(file_path).with_object(Hash.new(0)) { |line, h|
line.delete(BAD_CHARS).split.each { |word| h[word.length] += 1 } }
end
示例强>
str = "Let us wish the new President well,\neven if through gritted teeth."
FName = "test"
IO.write(FName, str)
#=> 66
count_words_by_length(FName)
#=> {3=>3, 2=>2, 4=>3, 9=>1, 7=>2, 5=>1}
<强>解释强>
也许解释这里发生的事情的最好方法是插入一些puts
语句并重新运行代码。
def count_words_by_length(file_path)
enum0 = IO.foreach(file_path)
puts "enum0=#{enum0}"
enum1 = enum0.with_object(Hash.new(0))
puts "enum1=#{enum1}"
puts "enum1.to_a=#{enum1.to_a}" # Show elements to be generated by enumerator
enum1.each do |line, h|
puts "line=#{line}"
puts " h=#{h}"
str = line.delete(BAD_CHARS)
puts " str=#{str}"
arr = str.split
puts " arr=#{arr}"
arr.each do |word|
h[word.length] += 1
puts " word=#{word.ljust(9)} length=#{word.length} h=#{h}"
end
end
end
count_words_by_length(FName)
然后
count_words_by_length(FName)
打印以下内容。
enum0=#<Enumerator:0x007ff782138130>
enum1=#<Enumerator:0x007ff782138018>
enum1.to_a=[["Let us wish the new President well,\n", {}],
["even if through gritted teeth.", {}]]
line=Let us wish the new President well,
h={}
str=Let us wish the new President well
arr=["Let", "us", "wish", "the", "new", "President", "well"]
word=Let length=3 h={3=>1}
word=us length=2 h={3=>1, 2=>1}
word=wish length=4 h={3=>1, 2=>1, 4=>1}
word=the length=3 h={3=>2, 2=>1, 4=>1}
word=new length=3 h={3=>3, 2=>1, 4=>1}
word=President length=9 h={3=>3, 2=>1, 4=>1, 9=>1}
word=well length=4 h={3=>3, 2=>1, 4=>2, 9=>1}
line=even if through gritted teeth.
h={3=>3, 2=>1, 4=>2, 9=>1}
str=even if through gritted teeth
arr=["even", "if", "through", "gritted", "teeth"]
word=even length=4 h={3=>3, 2=>1, 4=>3, 9=>1}
word=if length=2 h={3=>3, 2=>2, 4=>3, 9=>1}
word=through length=7 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>1}
word=gritted length=7 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>2}
word=teeth length=5 h={3=>3, 2=>2, 4=>3, 9=>1, 7=>2, 5=>1}
IO.foreach
和IO.write
通常会写为File.foreach
和File.write
。这是允许的,因为File
是IO
(File < IO #=> true
)的子类。