Ruby有意义的单词 - 删除停用词

时间:2015-03-02 21:42:34

标签: ruby nlp stop-words

我有一系列语音文件,我需要与一系列停用词进行比较,以删除停用词并留下剩下的有意义的单词。

到目前为止,我有这样的事情:

stopwords = File.readlines('PATH TO TXT FILE')
speeches = []

Dir.glob('PATH TO ALL SPEECHES').each do |speech|
    #code to read each speech and store into an array
    f = File.readlines(speech)
    speeches << f
end

lincolnSpeech = speeches[0]

def process_file(file_name)
    all_words = file_name.scan(/\w+/)
    meaningful_words = all_words.select { |word| !stopwords.include?(word) }
    return meaningful_words
end

我将此函数的结果嵌入到我的HTML中,如下所示:

<ul>
      <li><pre style="white-space: pre-wrap;word-wrap: break-word">#{process_file(lincolnSpeech)}</pre></li>
</ul>

但这会破坏页面并导致我的HTML完全消失。我已将问题缩小到函数中的行:

meaningful_words = all_words.select { |word| !stopwords.include?(word) }

这条线是罪魁祸首。我不确定为什么它会破坏我的代码。也许部分内容已被弃用?任何人都可以提出一些关于为什么这不起作用的想法,也许还有其他方法来实现我想要的效果吗?

1 个答案:

答案 0 :(得分:0)

我真的很惊讶你没有在scan电话上收到NoMethod错误。 File.readlines返回一个字符串数组,因此传递给lincolnSpeech的{​​{1}}是一个数组,我不认为数组有process_file方法。

假设你的停用词在该文件中是每行一个,我会做这样的事情:

scan

有两个可能存在的大问题 - 一个是没有争议的require 'set' # Finding an item in a Set will be faster than finding one in # an array, especially if the array is large. stopwords = Set.new(File.readlines('PATH TO TXT FILE')) speech_files = Dir.glob('PATH TO ALL SPEECHES') lincoln_speech = speech_files[0] def process_file(file_name) speech_words = File.read(file_name).split # get each word in file speech_words.reject { |word| stopwords.include?(word) }.join(' ') # reject stopwords and glue it back together end 调用有点幼稚,并且会在它分裂的单词中包含标点符号,就像这样。

split

注意附在单词上的逗号。使用"Well, space is there, and we're going to climb it,".split # => ["Well,", "space", "is", "there,", "and", "we're", "going", "to", "climb", "it,"] 是部分解决方案,但会将split(/\W+/)拆分为"we're"

另一个问题是["we", "re"]假设每个单词都被空格分隔,而有些单词可能已被换行符分隔。对于简单的输入,我发送的内容应该可以很好地工作,但是如果你正在处理相当复杂的演讲,你可能需要稍微清理你的输入才能实现这一点。