我已打开一个文件,并使用readlines()
和split()
与正则表达式'\t'
删除了TAB,结果如下:
["1", "cats", "--,"]
["2", "chase", "--,"]
["3", "dogs", "--,"]
["1", "the", "--,"]
["2", "car", "--,"]
["3", "is", "--,"]
["4", "gray", "--,"]
现在我想通过将索引[0]上的整数作为句子边界循环来提取并切片到像“猫追逐狗”和“汽车是灰色的”这样的子列表中。例如1 - 3子目录“猫追逐狗”,然后继续计数1-4到子列表“汽车是灰色的”等等其他列表所以我得到子列表["the", "car", "is", "gray" ]
。我该怎么做?
我试过这个但我收到了错误:
无法连接int + str
将for循环中的“i”检测为字符串元素而不是整数:
with open(buffer, 'r') as f:
words = []
for line in f:
items = line.split('\t')[:1]
for i in items:
while i>1:
i = i+1
print i
答案 0 :(得分:2)
类似的东西:
from itertools import groupby
with open('yourfile') as fin:
# split lines
lines = (line.split() for line in fin)
# group by consecutive ints
grouped = groupby(enumerate(lines), lambda (idx, el): idx - int(el[0]))
# build sentences from words in groups
sentences = [' '.join(el[1][1] for el in g) for k, g in grouped]
# ['cats chase dogs', 'the car is gray']
注意:这是基于您的示例数据:
example = [
["1", "cats", "--,"],
["2", "chase", "--,"],
["3", "dogs", "--,"],
["1", "the", "--,"],
["2", "car", "--,"],
["3", "is", "--,"],
["4", "gray", "--,"]
]
答案 1 :(得分:0)
选择合适的数据结构可以使工作更轻松:
container = [["1", "cats", "--,"],
["2", "chase", "--,"],
["3", "dogs", "--,"],
["1", "the", "--,"],
["2", "car", "--,"],
["3", "is", "--,"],
["4", "gray", "--,"]]
将列表嵌套在容器列表中,然后使用字典存储输出列表:
from collections import defaultdict
out = defaultdict(list) # Initialize dictionary for output
key = 0 # Initialize key
for idx, word, _ in container: # Unpack sublists
if int(idx) == 1: # Check if we are at start of new sentence
key += 1 # Increment key for new sentence
out[key].append(word) # Add word to list
给出:
{
1: ['cats', 'chase', 'dogs'],
2: ['the', 'car', 'is', 'gray']
}