我试图在句子列表中折叠常见的子字符串,并仅显示它们不同的区域。所以采取这个:
Please don't kick any of the cats
Please do kick any of the cats
Please don't kick any of the dogs
Please do kick any of the dogs
Please don't kick any of the garden snakes
Please do pet any of the garden snakes
然后回复:
Please [don't|do] [kick|pet] any of the [cats|dogs|garden snakes]
我正在寻找有关算法的帮助。我相信这是LCS问题的变种,我认为某种后缀树的处理。可能解释和实现的伪代码将是理想的。
Please join thirteen of your friends at the Midnight Bash this Friday
Don't forget to join your friend John at the Midnight Bash tomorrow
Don't forget to join your friends John and Julie at the Midnight Bash tonight
变成:
[Please|Don't forget to]
join
[thirteen of your friends|your friend John|your friends John and Julie]
at the Midnight Bash
[this Friday|tomorrow|tonight]
这种做法怎么样......
for an array of sentences
loop with the remaining sentence
find the "first common substring (FCS)"
split the sentences on the FCS
every unique phrase before the FCS is part of the set of uncommon phrases
trim the sentence by the first uncommon phrase
end loop
答案 0 :(得分:0)
将每个唯一的单词映射到单个对象。然后构建条件概率表(参见Markov chains)以枚举一个单词跟随每个序列的次数。
答案 1 :(得分:-1)
有趣的是,我一直在考虑在很久以前创造像你这样的东西,直到我意识到这实际上是一种人工智能。需要考虑的因素太多:语法,语法,情境,错误等等。但是如果你的输入总是如此固定,就像"请[A1 | A2 | ..] [B1 | B2 | ..]任何一个[C1 | C2 | ..]"然后可能是一个简单的正则表达式模式:" ^请\ s *(?(不要| t)执行)\ s *(?\ w +)+ \ s *任何\ s *(?。)* $"。