Question

像“a”，“the”，“best”，“kind”这样的词。我很确定有很好的方法来实现这个目标

为了清楚起见，我正在寻找

可以实施的最简单的解决方案，最好是在ruby中。
我对错误有很高的容忍度
如果我需要一个常用短语库，那么也非常满意

Answer 1

这些常用词被称为“停用词” - 这里有类似的stackoverflow问题："Stop words" list for English?

总结：

如果您需要处理大量文本，则有必要收集有关该特定数据集中单词频率的统计信息，并将最常用的单词用于停用词列表。（你在你的例子中包含“kind”，告诉我你可能有一组非常不寻常的数据，比如有很多像“种类”这样的口语表达，所以也许你需要这样做。）
既然你说你不太关心错误，那么仅仅使用其他人产生的英语停用词列表就足够了，例如： fairly long one used by MySQL或anything else that Google turns up。

如果您只是将这些单词放入程序中的哈希值，则应该可以轻松过滤任何单词列表。

Answer 2

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join

 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

Answer 3

这是DigitalRoss答案的变体。

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

也相关： What's the fastest way to check if a word from one string is in another string?

Answer 4

等一下，你需要做一些研究才能拿出停用词（又名噪音词，垃圾词）。索引大小和处理资源不是唯一的问题。很大程度上取决于最终用户是否会输入查询，或者您将使用长时间自动查询。

所有搜索日志分析都显示人们倾向于为每个查询键入一到三个单词。当这一切都必须与之合作时，我们不能失去任何东西。例如，一个集合可能在许多文档上都有“版权”一词 - 这很常见 - 但如果索引中没有单词，则无法进行精确的短语搜索或邻近相关性排名。此外，有充分合理的理由寻找最常见的词：人们可能正在寻找“谁”，或更糟糕的是，“The The”。

因此，虽然存在需要考虑的技术问题，并且取出停用词是一种解决方案，但它可能不是您尝试解决的整体问题的正确解决方案。

Answer 5

如果您有一个要删除名为stop_words的单词数组，那么您将从此表达式中获得结果：

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

如果您想保留每个单词之间的非单词字符，

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join

简单过滤掉文本描述中的常用词

5 个答案: