Question

我有数百万个数组，每个数组包含大约五个字符串。我试图删除所有＆＃34;垃圾词＆＃34; （因为没有更好的描述）来自数组，例如所有的文章，像＆＃34;到＆＃34;，＆＃34;＆＃34;，＆＃34;或＆＃34;，＆＃ 34;＆＃34;，＆＃34; a＆＃34;等等。

例如，我的一个数组有这六个字符串：

"14000"
"Things"
"to"
"Be"
"Happy"
"About"

我想从数组中删除"to"。

一种解决方案是：

excess_words = ["to","and","or","the","a"]
cleaned_array = dirty_array.reject {|term| excess_words.include? term}

但我希望避免手动输入每个多余的单词。有谁知道在这个过程中有用的Rails函数或帮助器？或者也许是一系列＆＃34;垃圾词＆＃34;已经写好了吗？

Answer 1

处理停用词很容易，但我建议您在将字符串拆分为组成单词之前执行此操作。

构建一个相当简单的正则表达式可以简化单词：

STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i

clean_string = 'to into and sandbar or forest the thesis a algebra'.gsub(STOPWORDS, '')
# => " into  sandbar  forest  thesis  algebra"

clean_string.split
# => ["into", "sandbar", "forest", "thesis", "algebra"]

如果你已将它们分开，你如何处理它们？我将join(' ')数组转回字符串，然后运行上面的代码，再次返回数组。

incoming_array = [
  "14000",
  "Things",
  "to",
  "Be",
  "Happy",
  "About",
]

STOPWORDS = /\b(?:#{ %w[to and or the a].join('|') })\b/i
# => /\b(?:to|and|or|the|a)\b/i

incoming_array = incoming_array.join(' ').gsub(STOPWORDS, '').split
# => ["14000", "Things", "Be", "Happy", "About"]

你可以尝试使用数组的集合操作，但是你会对单词的区分大小写产生冲突，迫使你迭代停用词和运行速度较慢的数组。 / p>

看看这两个答案，了解一些关于如何构建非常强大的模式的附加提示，以便轻松匹配数千个字符串：

＆＃34; How do I ignore file types in a web crawler?＆＃34;
＆＃34; Is there an efficient way to perform hundreds of text substitutions in Ruby?＆＃34;

Answer 2

您需要的只是英语停用词的列表。您可以找到它here，或谷歌搜索英语停用词列表＆＃39;

从字符串或字符串数组中删除多余的垃圾字

2 个答案:

从字符串或字符串数​​组中删除多余的垃圾字

2 个答案:

从字符串或字符串数组中删除多余的垃圾字