如何使用NLTK生成随机段落

时间:2012-12-13 10:19:56

标签: python nltk

我正在尝试构建一个测试单元来强调测试一个非常大的发布管理实现。我想过使用NLTK生成段落,关于文章的不同内容和随机标题。

NLTK是否有能力做这样的事情?我想尝试使每篇文章都与众不同,以测试不同的布局大小。我也想对主题做同样的事情。

P.S我需要生成100多万篇文章,最终将用于测试很多东西(性能,搜索,布局......等)

有人可以提供建议吗?

1 个答案:

答案 0 :(得分:5)

我用过这个。它需要来自Noam Chomsky的短语并生成随机段落。您可以将原料文本更改为您想要的任何内容。当然,你使用的文字越多越好。

# List of LEADINs to buy time.
leadins = """To characterize a linguistic level L,
        On the other hand,
        This suggests that
        It appears that
        Furthermore """

# List of SUBJECTs chosen for maximum professorial macho.
subjects = """ the notion of level of grammaticalness
        a case of semigrammaticalness of a different sort
        most of the methodological work in modern linguistics
        a subset of English sentences interesting on quite independent grounds
        the natural general principle that will subsume this case """

#List of VERBs chosen for autorecursive obfuscation.
verbs = """can be defined in such a way as to impose
        delimits
        suffices to account for
        cannot be arbitrary in
        is not subject to """


# List of OBJECTs selected for profound sententiousness.

objects = """ problems of phonemic and morphological analysis.
        a corpus of utterance tokens upon which conformity has been defined by the paired utterance test.
        the traditional practice of grammarians.
        the levels of acceptability from fairly high (e.g. (99a)) to virtual gibberish (e.g. (98d)).
        a stipulation to place the constructions into these various categories.
        a descriptive fact.
        a parasitic gap construction."""

import textwrap, random
from itertools import chain, islice, izip
from time import sleep

def chomsky(times=1, line_length=72):
    parts = []
    for part in (leadins, subjects, verbs, objects):
        phraselist = map(str.strip, part.splitlines())
        random.shuffle(phraselist)
        parts.append(phraselist)
    output = chain(*islice(izip(*parts), 0, times))
    return textwrap.fill(' '.join(output), line_length)

print chomsky()

为我回复:

  

这表明一种不同种类的半语法   不受符合性的话语权威语料库的约束   已经通过配对话语测试来定义。

和标题,当然你可以做

chomsky().split('\n')[0]