像“a”,“the”,“best”,“kind”这样的词。我很确定有很好的方法来实现这个目标
为了清楚起见,我正在寻找
答案 0 :(得分:2)
这些常用词被称为“停用词” - 这里有类似的stackoverflow问题:"Stop words" list for English?
总结:
如果您只是将这些单词放入程序中的哈希值,则应该可以轻松过滤任何单词列表。
答案 1 :(得分:1)
Common = %w{ a and or to the is in be }
Uncommon = %{
To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
Common.each { |w| ignore_me[w.downcase] = :Common }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join
, not : that question:
Whether 'tis nobler mind suffer
slings arrows of outrageous fortune,
take arms against sea of troubles,
by opposing end them? die: sleep;
No more; by sleep say we end
heart-ache thousand natural shocks
That flesh heir , 'tis consummation
Devoutly wish'd. die, sleep;
sleep: perchance dream: ay, there's rub;
For that sleep of death what dreams may come
答案 2 :(得分:1)
这是DigitalRoss答案的变体。
str=<<EOF
To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
EOF
common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')
也相关: What's the fastest way to check if a word from one string is in another string?
答案 3 :(得分:0)
等一下,你需要做一些研究才能拿出停用词(又名噪音词,垃圾词)。索引大小和处理资源不是唯一的问题。很大程度上取决于最终用户是否会输入查询,或者您将使用长时间自动查询。
所有搜索日志分析都显示人们倾向于为每个查询键入一到三个单词。当这一切都必须与之合作时,我们不能失去任何东西。例如,一个集合可能在许多文档上都有“版权”一词 - 这很常见 - 但如果索引中没有单词,则无法进行精确的短语搜索或邻近相关性排名。此外,有充分合理的理由寻找最常见的词:人们可能正在寻找“谁”,或更糟糕的是,“The The”。
因此,虽然存在需要考虑的技术问题,并且取出停用词是一种解决方案,但它可能不是您尝试解决的整体问题的正确解决方案。
答案 4 :(得分:0)
如果您有一个要删除名为stop_words
的单词数组,那么您将从此表达式中获得结果:
description.scan(/\w+/).reject do |word|
stop_words.include? word
end.join ' '
如果您想保留每个单词之间的非单词字符,
description.scan(/(\w+)(\W+)/).reject do |(word, other)|
stop_words.include? word
end.flatten.join