简单过滤掉文本描述中的常用词

时间:2011-01-11 07:28:14

标签: ruby text full-text-search taxonomy stop-words

像“a”,“the”,“best”,“kind”这样的词。我很确定有很好的方法来实现这个目标

为了清楚起见,我正在寻找

  1. 可以实施的最简单的解决方案,最好是在ruby中。
  2. 我对错误有很高的容忍度
  3. 如果我需要一个常用短语库,那么也非常满意

5 个答案:

答案 0 :(得分:2)

这些常用词被称为“停用词” - 这里有类似的stackoverflow问题:"Stop words" list for English?

总结:

  • 如果您需要处理大量文本,则有必要收集有关该特定数据集中单词频率的统计信息,并将最常用的单词用于停用词列表。 (你在你的例子中包含“kind”,告诉我你可能有一组非常不寻常的数据,比如有很多像“种类”这样的口语表达,所以也许你需要这样做。)
  • 既然你说你不太关心错误,那么仅仅使用其他人产生的英语停用词列表就足够了,例如: fairly long one used by MySQLanything else that Google turns up
  • 。{{3}}

如果您只是将这些单词放入程序中的哈希值,则应该可以轻松过滤任何单词列表。

答案 1 :(得分:1)

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join


 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

答案 2 :(得分:1)

这是DigitalRoss答案的变体。

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

也相关: What's the fastest way to check if a word from one string is in another string?

答案 3 :(得分:0)

等一下,你需要做一些研究才能拿出停用词(又名噪音词,垃圾词)。索引大小和处理资源不是唯一的问题。很大程度上取决于最终用户是否会输入查询,或者您将使用长时间自动查询。

所有搜索日志分析都显示人们倾向于为每个查询键入一到三个单词。当这一切都必须与之合作时,我们不能失去任何东西。例如,一个集合可能在许多文档上都有“版权”一词 - 这很常见 - 但如果索引中没有单词,则无法进行精确的短语搜索或邻近相关性排名。此外,有充分合理的理由寻找最常见的词:人们可能正在寻找“谁”,或更糟糕的是,“The The”。

因此,虽然存在需要考虑的技术问题,并且取出停用词是一种解决方案,但它可能不是您尝试解决的整体问题的正确解决方案。

答案 4 :(得分:0)

如果您有一个要删除名为stop_words的单词数组,那么您将从此表达式中获得结果:

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

如果您想保留每个单词之间的非单词字符,

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join