在使用关键字API获取热门关键字和词组后,我还会收到很多“脏”词,附加过多的词(“the”,“a”等)。
我还想在搜索字词中隔离名称。
是否有用于清理关键字列表的Ruby库?这样的算法是否存在?
答案 0 :(得分:5)
你所说的“停止词”,这些词语是“the”和“a”等词语,加上经常遇到的词语,它们毫无价值。
存在停用词列表; Wordnet有一个,如果我没记错,Lingua或Ruby Wordnet for Ruby或readablity模块中可能有一个,但实际上它们很容易自己生成。并且,您可能需要,因为垃圾词根据特定主题而变化。
最简单的方法是使用几个示例文档运行初步传递,然后将文本拆分为单词,然后循环遍历它们,并为每个文档递增一个计数器。完成后,查找长度为2到4个字母且不成比例地计算更高的字数。这些是停顿词的好选择。
然后运行遍历目标文档,像以前一样分割文本,随时计算出现次数。您可以忽略禁用词列表中的单词,也不要将它们添加到哈希,或处理所有内容,然后删除停用词。
text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.
These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT
# do this against several documents to build a stopword list. Tweak as necessary to fine-tune the words.
stopwords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }.select{ |n,v| n.length < 5 }
print "Stopwords => ", stopwords.keys.sort.join(', '), "\n"
# >> Stopwords => 2606, 3, and, are, by, com, edu, for, have, in, into, net, not, or, org, page, rfc, see, this, use, web, you, your
然后,您已准备好进行关键字收集:
text = <<EOT
You have reached this web page by typing "example.com", "example.net","example.org"
or "example.edu" into your web browser.
These domain names are reserved for use in documentation and are not available
for registration. See RFC 2606, Section 3.
EOT
stopwords = %w[2606 3 and are by com edu for have in into net not or org page rfc see this use web you your]
keywords = text.downcase.split(/\W+/).inject(Hash.new(0)) { |h,w| h[w] += 1; h }
stopwords.each { |s| keywords.delete(s) }
# output in order of most often seen to least often seen.
keywords.keys.sort{ |a,b| keywords[b] <=> keywords[a] }.each { |k| puts "#{k} => #{keywords[k]}"}
# >> example => 4
# >> names => 1
# >> reached => 1
# >> browser => 1
# >> these => 1
# >> domain => 1
# >> typing => 1
# >> reserved => 1
# >> documentation => 1
# >> available => 1
# >> registration => 1
# >> section => 1
在缩小单词列表之后,您可以通过WordNet运行候选人并查找同义词,同音异义词,单词关系,条带复数等。如果您对大量文本执行此操作,则需要将您的停用词保存在数据库中,您可以不断对其进行微调。同样的事情也适用于你的关键词,因为从那些你可以开始确定语气和其他语义上的好处。
答案 1 :(得分:0)
bad_words = ["the", "a", "for", "on"] #etc etc
# Strip non alpha chars, and split into a temp array, then cut out the bad words
tmp_str = str.gsub(/[^A-Za-z0-9\s]/, "").split - bad_words
str = tmp_str.join(" ")