Question

如果我有一个没有空格的字符串，只是像"hellocarworld"这样的连接，我想找回一个最大的字典单词数组。所以我会得到['hello','car','world']。我不会收回'a'之类的字词，因为它属于'car'。

字典单词可以来自任何地方，例如unix上的字典：

words = File.readlines("/usr/share/dict/words").collect{|x| x.strip}

string= "thishasmanywords"

你会怎么做呢？

Answer 1

我建议如下。

<强>代码

对于给定的string和字典，dict：

string_arr = string.chars
string_arr.size.downto(1).with_object([]) { |n,arr|
  string_arr.each_cons(n) { |a|
    word = a.join
    arr << word if (dict.include?(word) && !arr.any? {|w| w.include?(word) })}}

<强>实施例

dict = File.readlines("/usr/share/dict/words").collect{|x| x.strip}

string = "hellocarworld"
  #=> ["hello", "world", "loca", "car"]

string= "thishasmanywords"
  #=> ["this", "hish", "many", "word", "sha", "sma", "as"]

“loca”是“locus”的复数形式。我从未听说过“hish”，“sha”或“sma”。它们似乎都是俚语，因为我只能在称为“Urban Dictonary”的东西中找到它们。

<强>解释

string_arr = "hellocarworld".chars
  #=> ["h", "e", "l", "l", "o", "c", "a", "r", "w", "o", "r", "l", "d"]
string_arr.size 
  #=> 13

所以对于这个字符串我们有：

13.downto(1).with_object([]) { |n,arr|...

其中arr是一个初始为空的数组，将被计算并返回。对于n => 13，

enum = string_arr.each_cons(13)
  #<Enumerator: ["h","e","l","l","o","c","a","r","w","o","r","l","d"]:each_cons(13)>

枚举由单个数组string_arr组成的数组：

enum.size                #=> 1
enum.first == string_arr #=> true

将单个数组分配给块变量a，因此我们得到：

word = enum.first.join
  #=> "hellocarworld"

我们找到了

dict.include?(word) #=> false

所以这个单词没有添加到数组arr中。在字典中我们会检查以确保它不是arr中已有的任何单词的子字符串，它们都是相同大小或更大（更长的单词）。

接下来我们计算：

enum = string_arr.each_cons(12)
  #<Enumerator: ["h","e","l","l","o","c","a","r","w","o","r","l","d"]:each_cons(12)>

我们可以看到

枚举两个数组：

enum = string_arr.each_cons(12).to_a
  #=> [["h", "e", "l", "l", "o", "c", "a", "r", "w", "o", "r", "l"],
  #    ["e", "l", "l", "o", "c", "a", "r", "w", "o", "r", "l", "d"]]

对应于单词：

enum.first.join #=> "hellocarworl"
enum.last.join  #=> "ellocarworld"

这些都不在字典中。我们继续这种方式，直到我们到达n => 1：

string_arr.each_cons(1).to_a
  #=> [["h"], ["e"], ["l"], ["l"], ["o"], ["c"],
  # ["a"], ["r"], ["w"], ["o"], ["r"], ["l"], ["d"]]

我们在字典中只找到“a”，但因为它是“loca”或“car”的子字符串，它们已经是数组arr的元素，所以我们不添加它。

Answer 2

从输入字符串的开头开始，找到字典中最长的单词。将该单词从输入字符串的开头切掉并重复。

输入字符串为空后，您就完成了。如果字符串不为空但未找到任何单词，请删除第一个字符并继续该过程。

Answer 3

如果您不熟悉该技术，这可能有点棘手。我经常在这方面严重依赖正则表达式：

words = File.readlines("/usr/share/dict/words").collect(&:strip).reject(&:empty?)
regexp = Regexp.new(words.sort_by(&:length).reverse.join('|'))

phrase = "hellocarworld"

equiv = [ ]

while (m = phrase.match(regexp)) do
  phrase.gsub!(m[0]) do
   equiv << m[0]
   '*'
  end
end

equiv
# => ["hello", "car", "world"]

更新：删除空字符串，这会导致while循环永远运行。

如何找到字符串中所有最长的单词？

3 个答案: