Question

如果我有大量文本，并且我正在尝试发现最常出现的模板，我正在考虑使用N-Gram方法解决它，事实上它被建议作为this问题中的解决方案好吧，但我的要求略有不同。只是为了澄清，我有一些这样的文字：

I wake up every day morning and read the newspaper and then go to work
I wake up every day morning and eat my breakfast and then go to work
I am not sure that this is the solution but I will try
I am not sure that this is the answer but I will try
I am not feeling well today but I will get the work done and deliver it tomorrow
I was not feeling well yesterday but I will get the work done and let you know by tomorrow

我正试图像这样提取“模板”：

I wake up every day morning and ... and then go to work
I am not sure that this is the ... but I will try
I ... not feeling well ... but I will get the work done and ... tomorrow

我正在寻找一种可以扩展到数百万行文本的方法，所以我只是想知道我是否可以采用相同的N-gram方法来解决这个问题，还是有其他选择？

Answer 1

数百万行文字并不是一个很大的数字：）

您正在寻找的内容至少与搭配发现类似。您可以尝试计算n-gram上的逐点互信息。有关此问题以及解决问题的其他方法，请参阅Manning & Schütze (1999)。

在给定文本中发现“模板”？

1 个答案: