n-gram（在给定样本输入流的情况下可以是可选的）

Question

示例输入流：[ ('t','h'), ('h','e'), ('e', ' '), (' ','f') , ('f','o'), ('o','x'), ('x',' '), (' ','a'), ('a','t'), ('t','e'), ('e', <p>) ]

假设您有一个句子{ABCABA}，其中每个字母都是字符或单词，具体取决于标记化。

然后你的行李包是{(AB), (BC), (CA), (AB), (BA)}。

从这里开始，我需要一个算法来列出与原始句子长度相同的句子的所有可能排列，给出这些bigrams。在此，{ABCABA}（原始序列）和(ABABCA)都是有效的可能句子，但{ACBABA}不是。这个例子适用于双字母组合，但我也需要这个适用于任何 $n$ 。有什么想法吗？

Answer 1

构建有向图，然后使用递归来枚举长度为k的所有可能路径。好的，

def buildgraph(input, n):
    # n-1-gram to tokens that follow it
    graph = {
        tuple(input[i:(i + n - 1)]): set()
        for i in range(len(input) - n + 1)
    }
    for i in range(len(input) - n + 1):
        graph[tuple(input[i:(i + n - 1)])].add(input[i + n - 1])
    return graph


def continuations(graph, n, k, pathsofar):
    if len(pathsofar) == k:
        yield pathsofar
    elif len(pathsofar) < k:
        for token in graph[pathsofar[-(n - 1):]]:
            yield from continuations(graph, n, k, pathsofar + (token, ))


def allsentences(input, n, k):
    graph = buildgraph(input, n)
    for ngram in graph:
        yield from continuations(graph, n, k, ngram)


for sent in allsentences('abcaba', 2, 6):
    print(''.join(sent))

Answer 2

这是一个非常简单的解决方案。首先，计算所有n-gram;第二，得到这些n-gram的所有可能的子列表，并获得这些子列表的所有排列。

n-gram（在给定样本输入流的情况下可以是可选的）

您可以使用理解列表。从列表n开始0到n-1 [sentence[k:] for k in range(n)]，将列表ABCABA倍。对于3和[ABCABA, BCABA, CABA]，您获得def ngrams(sentence, n): return ["".join(t) for t in zip(*[sentence[k:] for k in range(n)])]。你只需要压缩列表并加入生成的元组（注意星号来解压缩参数）：

>>> ng = ngrams("ABCABA", 2)
>>> ng
['AB', 'BC', 'CA', 'AB', 'BA']

这给出了：

combinations

列出句子

您可以使用itertools，特别是permutations和combinations。 >>> list(itertools.combinations(ng, 2)) [('AB', 'BC'), ('AB', 'CA'), ('AB', 'AB'), ('AB', 'BA'), ('BC', 'CA'), ('BC', 'AB'), ('BC', 'BA'), ('CA', 'AB'), ('CA', 'BA'), ('AB', 'BA')]函数给出了“输入可迭代元素的r长度子序列”：

permutations

您必须为每个可能的长度采取组合。 def sentences(sentence, n): ng = ngrams(sentence, n) for k in range(len(ng)): for c in itertools.combinations(ng, k): for p in itertools.permutations(c): yield("".join(p))函数将置换所有这些子序列：

def sentences(sentence, n):
    ng = ngrams(sentence, n)
    return ("".join(p) for k in range(len(ng)) for c in itertools.combinations(ng, k) for p in itertools.permutations(c))

或者使用生成器理解：

>>> list(sentences("ABCABA", 2))
['', 'AB', 'BC', 'CA', 'AB', 'BA', 'ABBC', 'BCAB', ..., 'ABBACABC', 'BABCCAAB', 'BABCABCA', 'BACABCAB', 'BACAABBC', 'BAABBCCA', 'BAABCABC']

这提供了206种可能性：

class MYAVPlayerViewController:AVPlayerViewController{

    override var supportedInterfaceOrientations: UIInterfaceOrientationMask{
        return .all // or you can replace this landscape
    }
}

从乱码的n-gram（python）列表中生成可能的句子

2 个答案:

n-gram（在给定样本输入流的情况下可以是可选的）

列出句子