Python - 将单词分组为3组

时间:2015-01-25 06:06:37

标签: python multidimensional-array

我试图创建一个多维数组,其中包含字符串中的单词 - 该单词之前的单词(除非在字符串的开头,空白),单词和后面的单词(除非在字符串的结尾,空白)

我尝试过以下代码:

def parse_group_words(text):
    groups = []
    words = re_sub("[^\w]", " ",  text).split()
    number_words = len(words)
    for i in xrange(number_words):
        print i
        if i == 0:
            groups[i][0] = ""
            groups[i][1] = words[i]
            groups[i][2] = words[i+1]
        if i > 0 and i != number_words:
            groups[i][0] = words[i-1]
            groups[i][1] = words[i]
            groups[i][2] = words[i+1]
        if i == number_words:
            groups[i][0] = words[i-1]
            groups[i][1] = words[i]
            groups[i][2] = ""            
    print groups

print parse_group_words("this is an example of text are you ready")

但我得到了:

0

Traceback (most recent call last):
  File "/home/akf/program.py", line 82, in <module>
    print parse_group_words("this is an example of text are you ready")
  File "/home/akf/program.py", line 69, in parse_group_words
    groups[i][0] = ""
IndexError: list index out of range

知道如何解决这个问题吗?

2 个答案:

答案 0 :(得分:1)

这是使用Python集合和itertools为任意大小的窗口执行此操作的通用方法:

import re
import collections
import itertools

def window(seq, n=3):
    d = collections.deque(maxlen=n)
    for x in itertools.chain(('', ), seq, ('', )):
        d.append(x)
        if len(d) >= n:
            yield tuple(d)

def windows(text, n=3):
    return list(window((x.group() for x in re.finditer(r'\w+', text)), n=n))

答案 1 :(得分:0)

怎么样......:

import itertools, re

def parse_group_words(text):
    groups = []
    words = re.finditer(r'\w+', text)
    prv, cur, nxt = itertools.tee(words, 3)
    next(cur); next(nxt); next(nxt)
    for previous, current, thenext in itertools.izip(prv, cur, nxt):
        # in Py 3, use `zip` in lieu of itertools.izip
        groups.append([previous.group(0), current.group(0), thenext.group(0)])
    print(groups)

parse_group_words('tanto va la gatta al lardo che ci lascia')

几乎你需要的东西 - 它会发出:

[['tanto', 'va', 'la'], ['va', 'la', 'gatta'], ['la', 'gatta', 'al'], ['gatta', 'al', 'lardo'], ['al', 'lardo', 'che'], ['lardo', 'che', 'ci'], ['che', 'ci', 'lascia']]

...缺少最后一个必需的群组['ci', 'lascia', '']

要修复它,就在print之前,您可以添加:

groups.append([groups[-1][1], groups[-1][2], ''])

这感觉就像是一个中间讨厌的黑客 - 我不能轻易找到一个优雅的方式来拥有这个最后一组&#34;只是出现&#34;从函数其余部分的一般逻辑来看。