Question

这是一个面试问题。假设您有一个字符串text和一个dictionary（一组字符串）。如何将text细分为子字符串，以便在dictionary中找到每个子字符串。

例如，您可以使用"thisisatext"将["this", "is", "a", "text"]细分为/usr/share/dict/words。

我相信回溯可以解决这个问题（在伪Java中）：

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}

List<String> solution = new List<String>()
solve(text, dict, solution)

有意义吗？你会优化搜索字典中前缀的步骤吗？您会推荐哪些数据结构？

Answer 1

在此blog post.

中，针对此问题的解决方案有一个非常详尽的说明

基本思想就是记住你写的函数，你将得到一个O（n ^ 2）时间，O（n）空间算法。

Answer 2

此解决方案假定字典存在Trie数据结构。此外，对于Trie中的每个节点，假定以下功能：

node.IsWord（）：如果该节点的路径是单词
node.IsChild（char x）：如果存在标签为x
node.GetChild（char x）：返回标签为x

Function annotate( String str, int start, int end, int root[], TrieNode node):
i = start
while i<=end:
    if node.IsChild ( str[i]):
        node = node.GetChild( str[i] )
        if node.IsWord():
            root[i+1] = start
        i+=1
    else:
        break;

end = len(str)-1
root = [-1 for i in range(len(str)+1)]
for start= 0:end:
    if start = 0 or root[start]>=0:
        annotate(str, start, end, root, trieRoot)

index  0  1  2  3  4  5  6  7  8  9  10  11
str:   t  h  i  s  i  s  a  t  e  x  t
root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7

我将离开该部分，通过反向遍历根列出构成字符串的单词。

时间复杂度为O（nk），其中n是字符串的长度，k是字典中最长字的长度。

PS：我假设字典中有以下单词：this，is，a，text，ate。

Answer 3

方法1 - Trie看起来非常贴合这里。生成英语词典中的单词。这栋建筑是一次性费用。构建完成后，您可以轻松地逐字比较string。如果在任何时候你遇到了一个叶子，你可以假设你找到了一个单词，将其添加到一个列表＆amp;继续你的遍历。进行遍历直至到达string的末尾。列表输出。

搜索的时间复杂度 - O（word_length）。

空间复杂度 - O（charsize * word_length * no_words）。你字典的大小。

方法2 - 我听说过Suffix Trees，从未使用它们，但它可能在这里很有用。

方法3 - 更迂腐＆amp;糟糕的选择。你已经建议了这个。

你可以尝试相反的方式。运行dict检查是否存在子字符串匹配。在这里，我假设dict中的键是英语词典words的{{1}}。所以psuedo代码看起来像这样 -

/usr/share/dict/words

复杂性 - O（n）遍历整个dict + O（1）以进行子串匹配。

空格 - 如果(list) splitIntoWords(String str, dict d) { words = [] for (word in d) { if word in str words.append(word); } return words; }

，最坏情况为O（n）

正如其他人所指出的，这确实需要回溯。

Answer 4

您可以使用Dynamic Programming和Hashing解决此问题。

计算字典中每个单词的哈希值。使用您最喜欢的哈希函数。我会使用类似（a1 * B ^（n - 1）+ a2 * B ^（n - 2）+ ... + a * B ^ 0）％P的东西，其中a1a2 ... an是一个字符串，n是字符串的长度，B是多项式的基数，P是大素数。如果你有一个字符串a1a2 ...的哈希值，你可以在常数时间内计算字符串a1a2 ... ana（n + 1）的哈希值：（hashValue（a1a2 ... an）* B + a （n + 1））％P。

这部分的复杂性是O（N * M），其中N是字典中的单词数，M是字典中最长单词的长度。

然后，使用这样的DP功能：

   bool vis[LENGHT_OF_STRING];
   bool go(char str[], int length, int position)
   {
      int i;

      // You found a set of words that can solve your task.
      if (position == length) {
          return true;
      }

      // You already have visited this position. You haven't had luck before, and obviously you won't have luck this time.
      if (vis[position]) {
         return false;
      }
      // Mark this position as visited.
      vis[position] = true;

      // A possible improvement is to stop this loop when the length of substring(position, i) is greater than the length of the longest word in the dictionary.
      for (i = position; position < length; i++) {
         // Calculate the hash value of the substring str(position, i);
         if (hashValue is in dict) {
            // You can partition the substring str(i + 1, length) in a set of words in the dictionary.
            if (go(i + 1)) {
               // Use the corresponding word for hashValue in the given position and return true because you found a partition for the substring str(position, length).
               return true;
            }
         }
      }

      return false;
   }

该算法的复杂度为O（N * M），其中N是字符串的长度，M是字典中最长字的长度或O（N ^ 2），具体取决于您是否编码了改进或不。

因此算法的总复杂度为：O（N1 * M）+ O（N2 * M）（或O（N2 ^ 2）），其中N1是字典中的字数，M是字典中最长单词的长度，N2是字符串的长度。

如果你不能想到一个好的哈希函数（没有任何碰撞），其他可能的解决方案是使用Tries或Patricia trie（如果普通trie的大小非常大）（我无法' t发布这些主题的链接，因为我的声誉不够高，不能发布超过2个链接）。但是在使用它时，算法的复杂性将是O（N * M）* O（在trie中查找单词所需的时间），其中N是字符串的长度，M是最长字的长度在字典里。

我希望它有所帮助，我为我可怜的英语道歉。

如何将给定的文本分解为字典中的单词？

4 个答案: