Python:将文本分成单元并创建单元列表

时间:2014-02-05 05:07:38

标签: regex list python-2.7 split

我需要创建一个读取短文本文件的Python脚本,将文本分成单元(段落,句子,单词),然后将文本打印出来作为这些单元的列表。另外,我希望能够打印出这些列表的子集(即前15个元素,元素8-10等)。

我是编程新手。我听说使用NLTK会更容易,但是我想在没有NLTK的情况下弄清楚如何做到这一点。这就是我目前要将段落分成句子的内容......但我真的不知道从这里去哪里以及如何最终得到所有单位的列表。

将一个段落分成句子

def para_to_sent(paragraph):
    import re
    sentence_end = re.compile('[.!?]')
    sentence_list = sentence_end.split(paragraph)
    return sentence_list

if __name__ == '__main__':
    p = """Big Brother watching Big Brother is so meta. According to Deputy Attorney General James Cole, the National Security Agency "probably" gathers phone records of Congressional lawmakers and staff. Cole was grilled by members of Congress at a House Judiciary Committee hearing on Tuesday, and said that he was unaware of anything that "scrubbed out" Congressional numbers from the NSA's data sweeps. But don't worry, Cole added that NSA officials aren't allowed to look at the findings "unless we have reasonable, articulable suspicion that those numbers are related to a known terrorist threat."""
    sentences = para_to_sent(p)
    for s in sentences:
        print s.strip()

如果这是一个愚蠢的问题,请告诉我!只是把我指向坚实的资源方向将是非常有帮助的!

0 个答案:

没有答案