Question

我已打开一个文件，并使用readlines()和split()与正则表达式'\t'删除了TAB，结果如下：

["1", "cats", "--,"]
["2", "chase", "--,"]
["3", "dogs", "--,"]
["1", "the", "--,"]
["2", "car", "--,"]
["3", "is", "--,"]
["4", "gray", "--,"]

现在我想通过将索引[0]上的整数作为句子边界循环来提取并切片到像“猫追逐狗”和“汽车是灰色的”这样的子列表中。例如1 - 3子目录“猫追逐狗”，然后继续计数1-4到子列表“汽车是灰色的”等等其他列表所以我得到子列表["the", "car", "is", "gray" ]。我该怎么做？

我试过这个但我收到了错误：

无法连接int + str

将for循环中的“i”检测为字符串元素而不是整数：

with open(buffer, 'r') as f:
    words = []
    for line in f:
        items = line.split('\t')[:1]
        for i in items:
            while i>1:
                i = i+1
                print i

Answer 1

类似的东西：

from itertools import groupby

with open('yourfile') as fin:
    # split lines
    lines = (line.split() for line in fin)
    # group by consecutive ints
    grouped = groupby(enumerate(lines), lambda (idx, el): idx - int(el[0]))
    # build sentences from words in groups
    sentences = [' '.join(el[1][1] for el in g) for k, g in grouped]
    # ['cats chase dogs', 'the car is gray']

注意：这是基于您的示例数据：

example = [
    ["1", "cats", "--,"],
    ["2", "chase", "--,"],
    ["3", "dogs", "--,"],
    ["1", "the", "--,"],
    ["2", "car", "--,"],
    ["3", "is", "--,"],
    ["4", "gray", "--,"]
]

Answer 2

选择合适的数据结构可以使工作更轻松：

container = [["1", "cats", "--,"],
             ["2", "chase", "--,"],
             ["3", "dogs", "--,"],
             ["1", "the", "--,"],
             ["2", "car", "--,"],
             ["3", "is", "--,"],
             ["4", "gray", "--,"]]

将列表嵌套在容器列表中，然后使用字典存储输出列表：

from collections import defaultdict

out = defaultdict(list)              # Initialize dictionary for output
key = 0                              # Initialize key  

for idx, word, _ in container:       # Unpack sublists
    if int(idx) == 1:                # Check if we are at start of new sentence
        key += 1                     # Increment key for new sentence
    out[key].append(word)            # Add word to list

给出：

{
    1: ['cats', 'chase', 'dogs'], 
    2: ['the', 'car', 'is', 'gray']
}

如何将编号列表切分为子列表

2 个答案: