解析段落:检测没有标点符号的句子

时间:2013-09-14 02:43:05

标签: python text nltk

假设我有以下文字:

  

实现这一目标的步骤包括:提高移动网络,数据中心,数据传输和频谱分配的效率减少应用程序必须通过缓存,压缩和未来技术(如点对点数据)从网络中提取的数据量转移通过教育人们了解数据的使用,创建在最初提供免费数据访问时蓬勃发展的商业模式,以及建立信用卡基础设施,使运营商可以从预付费模式转变为便于投资的后付费模式,从而使可访问性方面的投资获利如果该计划有效,移动运营商将获得更多客户并在可访问性方面投入更多资金;手机制造商会看到人们想要更好的设备;互联网服务提供商将联系更多人;人们将获得负担得起的互联网,这样他们就可以加入知识经济,并与他们关心的人建立联系。

通过阅读文本可以看出,这些是多个句子(点列表)。如何将此文本拆分为句子?我尝试过使用python NLTK,但没有运气。检查大写字母也不起作用,因为它不太可靠。

关于如何解决这个问题的任何想法?

感谢。

1 个答案:

答案 0 :(得分:1)

如果我理解正确,这个小代码可以帮助:(注意在python 2.7.5上测试)

paragraph = 'Steps toward this goal include: Increasing efficiency of mobile networks, data centers, data transmission, and spectrum allocation Reducing the amount of data apps have to pull from networks through caching, compression, and futuristic technologies like peer-to-peer data transfer Making investments in accessibility profitable by educating people about the uses of data, creating business models that thrive when free data access is offered initially, and building out credit card infrastructure so carriers can move from pre-paid to post-paid models that facilitate investment If the plan works, mobile operators will gain more customers and invest more in accessibility; phone makers will see people wanting better devices; Internet providers will get to connect more people; and people will receive affordable Internet so they can join the knowledge economy and connect with the people they care about.'
words = []
separators = ['.',',',':',';']
oldValue = 0
for value in range(len(paragraph)):
    if paragraph[value] in separators:
        words.append(paragraph[oldValue:value+1])
        oldValue = value+2
for word in words:
    print word

<强> [编辑] 你也可以用

轻松添加大写字母
if paragraph[value] == paragraph[value].upper():
    words.append(paragraph[oldValue:value+1])
    ...