我有一系列语音文件,我需要与一系列停用词进行比较,以删除停用词并留下剩下的有意义的单词。
到目前为止,我有这样的事情:
stopwords = File.readlines('PATH TO TXT FILE')
speeches = []
Dir.glob('PATH TO ALL SPEECHES').each do |speech|
#code to read each speech and store into an array
f = File.readlines(speech)
speeches << f
end
lincolnSpeech = speeches[0]
def process_file(file_name)
all_words = file_name.scan(/\w+/)
meaningful_words = all_words.select { |word| !stopwords.include?(word) }
return meaningful_words
end
我将此函数的结果嵌入到我的HTML中,如下所示:
<ul>
<li><pre style="white-space: pre-wrap;word-wrap: break-word">#{process_file(lincolnSpeech)}</pre></li>
</ul>
但这会破坏页面并导致我的HTML完全消失。我已将问题缩小到函数中的行:
meaningful_words = all_words.select { |word| !stopwords.include?(word) }
这条线是罪魁祸首。我不确定为什么它会破坏我的代码。也许部分内容已被弃用?任何人都可以提出一些关于为什么这不起作用的想法,也许还有其他方法来实现我想要的效果吗?
答案 0 :(得分:0)
我真的很惊讶你没有在scan
电话上收到NoMethod错误。 File.readlines
返回一个字符串数组,因此传递给lincolnSpeech
的{{1}}是一个数组,我不认为数组有process_file
方法。
假设你的停用词在该文件中是每行一个,我会做这样的事情:
scan
有两个可能存在的大问题 - 一个是没有争议的require 'set'
# Finding an item in a Set will be faster than finding one in
# an array, especially if the array is large.
stopwords = Set.new(File.readlines('PATH TO TXT FILE'))
speech_files = Dir.glob('PATH TO ALL SPEECHES')
lincoln_speech = speech_files[0]
def process_file(file_name)
speech_words = File.read(file_name).split # get each word in file
speech_words.reject { |word| stopwords.include?(word) }.join(' ') # reject stopwords and glue it back together
end
调用有点幼稚,并且会在它分裂的单词中包含标点符号,就像这样。
split
注意附在单词上的逗号。使用"Well, space is there, and we're going to climb it,".split
# => ["Well,", "space", "is", "there,", "and", "we're", "going", "to", "climb", "it,"]
是部分解决方案,但会将split(/\W+/)
拆分为"we're"
。
另一个问题是["we", "re"]
假设每个单词都被空格分隔,而有些单词可能已被换行符分隔。对于简单的输入,我发送的内容应该可以很好地工作,但是如果你正在处理相当复杂的演讲,你可能需要稍微清理你的输入才能实现这一点。