Question

我有一系列语音文件，我需要与一系列停用词进行比较，以删除停用词并留下剩下的有意义的单词。

到目前为止，我有这样的事情：

stopwords = File.readlines('PATH TO TXT FILE')
speeches = []

Dir.glob('PATH TO ALL SPEECHES').each do |speech|
    #code to read each speech and store into an array
    f = File.readlines(speech)
    speeches << f
end

lincolnSpeech = speeches[0]

def process_file(file_name)
    all_words = file_name.scan(/\w+/)
    meaningful_words = all_words.select { |word| !stopwords.include?(word) }
    return meaningful_words
end

我将此函数的结果嵌入到我的HTML中，如下所示：

<ul>
      <li><pre style="white-space: pre-wrap;word-wrap: break-word">#{process_file(lincolnSpeech)}</pre></li>
</ul>

但这会破坏页面并导致我的HTML完全消失。我已将问题缩小到函数中的行：

meaningful_words = all_words.select { |word| !stopwords.include?(word) }

这条线是罪魁祸首。我不确定为什么它会破坏我的代码。也许部分内容已被弃用？任何人都可以提出一些关于为什么这不起作用的想法，也许还有其他方法来实现我想要的效果吗？

Answer 1

我真的很惊讶你没有在scan电话上收到NoMethod错误。 File.readlines返回一个字符串数组，因此传递给lincolnSpeech的{{1}}是一个数组，我不认为数组有process_file方法。

假设你的停用词在该文件中是每行一个，我会做这样的事情：

scan

有两个可能存在的大问题 - 一个是没有争议的require 'set' # Finding an item in a Set will be faster than finding one in # an array, especially if the array is large. stopwords = Set.new(File.readlines('PATH TO TXT FILE')) speech_files = Dir.glob('PATH TO ALL SPEECHES') lincoln_speech = speech_files[0] def process_file(file_name) speech_words = File.read(file_name).split # get each word in file speech_words.reject { |word| stopwords.include?(word) }.join(' ') # reject stopwords and glue it back together end调用有点幼稚，并且会在它分裂的单词中包含标点符号，就像这样。

split

注意附在单词上的逗号。使用"Well, space is there, and we're going to climb it,".split # => ["Well,", "space", "is", "there,", "and", "we're", "going", "to", "climb", "it,"]是部分解决方案，但会将split(/\W+/)拆分为"we're"。

另一个问题是["we", "re"]假设每个单词都被空格分隔，而有些单词可能已被换行符分隔。对于简单的输入，我发送的内容应该可以很好地工作，但是如果你正在处理相当复杂的演讲，你可能需要稍微清理你的输入才能实现这一点。

Ruby有意义的单词 - 删除停用词

1 个答案: