我有一个计算字符串中单词频率的方法。我手动包括一些应该删除的单词。我发现,对于短字符串,'the'被删除...对于较长的字符串,如下面的字符串,该方法仍然打印'the'。关于为什么会这样以及如何解决它的任何想法?
def count_words(string)
words = string.downcase.split(' ')
delete_list = ['the']
delete_list.each do |del|
words.delete_at(words.index(del))
end
frequency = Hash.new(0)
words.each do |word|
frequency[word.downcase] += 1
end
return frequency.sort_by {|k,v| v}.reverse
end
puts count_words('Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
sales metrics often do not reflect the contributions of the role, which demonstrates that line management is out of touch of what the individual contributors role really does
middle management does not care about the career of his/her directs, 90% of the time management competes directly with their people, or takes credit for their work
lots of back stabbing going on
Microsoft changes the organization or commitment or comp model, faster than the average deal cycle, making it next to near impossible to develop momentum in role or a rhythm of success
execs promote themselves in years when they freeze employees merit increases
only way to advance is to step on your peers/colleagues and take credit for work you had no impact on, beat your chest loud enough and you get "visibility" you need to advance
visibility is not based on performance by enlarge, it is based on being in your manager\'s swim lane for advancement
I have observed people get promoted in years when they did not meet their quota, nor did the earn the highest performance on the team, they kissed their way to the promotion
Advice to Senior Management 1, get back to risk taking and teaming, less politics please, you are killing the company
2, set realistic commitments and stick to them for multiple years, stop changing the game faster than your people can react
3, stop over engineering commitments and over segmenting the company, people are not willing to collaborate or be corporate citizens
4, too many empty suits in middle management, keep flattening out the company and getting rid of middle managers that run reports all day, get back to a culture where managers also sell and drives wins
5, keep your word microsoft, you said stability, but you keep tinkering with the org too much for any changes to take affect A great Culture
Limitless opportunities
Supportive Management team who are passionate about people
A company that really does want you to have a good work life balance and backs it up with policies that enable you to manage how and where you work.
Cons Support resources are constrained
Can be overly competitve and hard to get noticed
Sales rewards are definitely prioritised and marketing cuts are always prioritised.
Consumer organisation is still far from ideal.
Advice to Senior Management Focus on getting the internal organisation simplified to improve performance and increase empowerment.
Get some REAL consumer focus and invest for the long term
Start connecting with people, focussing on telling stories rather than selling products.')
答案 0 :(得分:1)
只需使用words.delete("the")
即可。你需要做的就是给它钥匙。
您的程序的简单版本将是:
def count_words(string)
words = string.downcase.split(' ').each_with_object(Hash.new(0)) { |w,o| o[w] += 1 }
delete_list = ['the']
delete_list.each { |del| words.delete(del) }
frequency.sort_by {|k,v| v}.reverse
end
答案 1 :(得分:1)
在分析SEO的网页时,这是一个非常常见的问题。这是我写的快速版本:
require 'pp'
STOP_WORDS = %w[a and of the]
def count_words(string)
word_count = string
.downcase
.gsub(/[^a-z ]+/, '')
.split
.group_by{ |w| w }
STOP_WORDS.each do |stop_word|
word_count.delete(stop_word)
end
word_count
.map{ |k,v| [k, v.size]}
.sort_by{ |k, c| [-c, k] }
end
pp count_words(<<EOT)
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
Start connecting with people, focussing on telling stories rather than selling products.
EOT
我故意截断样本数据以便于阅读。
在该主题上,您可以使用here-to(“<<
”)在必须传入大量文本时改进代码的格式。另一种方法是插入__END__
标记并将其全部放在其后,然后使用特殊的IO对象DATA
来读取该结尾块:
pp count_words(DATA.read)
__END__
Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking
politics have gotten out of control since gates left the building..
Start connecting with people, focussing on telling stories rather than selling products.
在任何一种情况下,代码输出:
[["of", 2], ["and", 1], ["are", 1], ["benefits", 1], ["buildingstart", 1], ["compensation", 1], ["connecting", 1], ["cons", 1], ["control", 1], ["empty", 1], ["fair", 1], ["focussing", 1], ["gates", 1], ["gotten", 1], ["great", 1], ["have", 1], ["left", 1], ["little", 1], ["management", 1], ["middle", 1], ["off", 1], ["on", 1], ["out", 1], ["people", 1], ["products", 1], ["pros", 1], ["rather", 1], ["reasonable", 1], ["risk", 1], ["selling", 1], ["since", 1], ["stories", 1], ["suits", 1], ["takingpolitics", 1], ["telling", 1], ["than", 1], ["time", 1], ["very", 1], ["vision", 1], ["void", 1], ["with", 1]]
gsub(/[^a-z ]+/, '')
剥去任何不是字母或空格的东西。可枚举的group_by
正在进行繁重的工作。此外,Enumerable的sort_by
可以很容易地通过计数和单词进行反向排序。
我在删除停用词时使用哈希而不是数组,因为迭代STOP_WORD
列表通常比尝试迭代语料库中的词更快。一个大的语料库很可能会有更多的单词而不是停止单词。