我有一个具有以下结构的文本文件:
name1:
sentence. [sentence. ...] # can be one or more
name2:
sentence. [sentence. ...]
编辑:输入示例:
Djohn:
Hello. I am Djohn
I am Djohn.
Bot:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id
est laborum.
Ninja:
Hey guys!! wozzup
编辑2 输入示例:
This is example sentence that can come before first speaker.
Djohn:
Hello. I am Djohn
I am Djohn.
Bot:
Yes, I understand, don't say it twice lol
Ninja:
Hey guys!! wozzup
每个项目(名称或句子是一个 Unicode 字符串。我将此数据放入列表中,并希望组成一个词典:
{
'name1': [[sentence.], ..]
'name2': [[sentence.], ..]
}
编辑3
我正在构建的字典旨在写入文件中,并且是一堆 Unicode 字符串。
我想做的是这样:
for i, paragraph in enumerate(paragraphs): # paragraphs is the list
# with Unicode strings
if isParagraphEndsWithColon(paragraph):
name = paragraph
text = []
for p in range(paragraphs[i], paragraphs[-1]):
if isParagraphEndsWithColon(p):
break
localtext.extend(p)
# this is output dictionary I am trying to build
outputDocumentData[name].extend(text)
例如我需要从找到的“名称:”句子到下一个句子进行嵌套循环,同时扩展相同键(即名称)的句子列表。 问题是range()在这里对我不起作用,因为它需要整数。
寻找从当前元素到列表末尾的嵌套循环的“ pythonic”方法。 (感觉像在每次迭代中对列表进行切片都会效率低下)
答案 0 :(得分:3)
您可以使用groupby:
from itertools import groupby
lines = ["Djohn:",
"Hello. I am Djohn",
"I am Djohn.",
"Bot:",
"Yes, I understand, don't say it twice lol",
"Ninja:",
"Hey guys!! wozzup"]
name = ''
result = {}
for k, v in groupby(lines, key= lambda x: x.endswith(':')):
if k:
name = ''.join(v).lstrip(':')
else:
result.setdefault(name, []).extend(list(v))
print(result)
输出
{'Djohn:': ['Hello. I am Djohn', 'I am Djohn.'], 'Ninja:': ['Hey guys!! wozzup'], 'Bot:': ["Yes, I understand, don't say it twice lol"]}
这个想法是将输入分组到名称行中,而不是名称行中,因此可以用作key lambda x: x.endswith(':')
。