我有一个句子列表,如:
Sentence 1.
And Sentence 2.
Or Sentence 3.
New Sentence 4.
New Sentence 5.
And Sentence 6.
我试图根据“连词标准”对这些句子进行分组,这样如果下一个句子以连词开头(目前只是“和”或“或”),那么我想将它们分组为:
Group 1:
Sentence 1.
And Sentence 2.
Or Sentence 3.
Group 2:
New Sentence 4.
Group 3:
New Sentence 5.
And Sentence 6.
我编写了以下代码,它以某种方式检测连续句子但不是所有句子。
我如何递归编码?我尝试迭代编码,但有些情况下它不起作用,我无法弄清楚如何在递归中编码。
tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.","New Sentence 4.","New Sentence 5.","And Sentence 6."]
already_selected = []
attachlist = {}
for i in tokens:
attachlist[i] = []
for i in range(len(tokens)):
if i in already_selected:
pass
else:
for j in range(i+1, len(tokens)):
if j not in already_selected:
first_word = nltk.tokenize.word_tokenize(tokens[j].lower())[0]
if first_word in conjucture_list:
attachlist[tokens[i]].append(tokens[j])
already_selected.append(j)
else:
break
答案 0 :(得分:3)
tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.",
"New Sentence 4.","New Sentence 5.","And Sentence 6."]
result = list()
for token in tokens:
if not token.startswith("And ") and not token.startswith("Or "): #trailing whitespace because of the cases like "Andy ..." and "Orwell ..."
result.append([token])
else:
result[-1].append(token)
结果:
[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'],
['New Sentence 4.'],
['New Sentence 5.', 'And Sentence 6.']]
答案 1 :(得分:0)
我有嵌入式迭代器和泛型的东西,所以这里是一个超通用的方法:
import re
class split_by:
def __init__(self, iterable, predicate=None):
self.iter = iter(iterable)
self.predicate = predicate or bool
try:
self.head = next(self.iter)
except StopIteration:
self.finished = True
else:
self.finished = False
def __iter__(self):
return self
def _section(self):
yield self.head
for self.head in self.iter:
if self.predicate(self.head):
break
yield self.head
else:
self.finished = True
def __next__(self):
if self.finished:
raise StopIteration
section = self._section()
return section
[list(x) for x in split_by(tokens, lambda sentence: not re.match("(?i)or|and", sentence))]
#>>> [['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]
它更长,但它的O(1)
空间复杂度并且可以预测你的选择。
答案 2 :(得分:0)
这个问题可以更好地迭代而不是递归地解决,因为输出只需要一个级别的分组。如果您正在寻找递归解决方案,请举例说明任意级别分组。
def is_conjunction(sentence):
return sentence.startswith('And') or sentence.startswith('Or')
tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.",
"New Sentence 4.","New Sentence 5.","And Sentence 6."]
def group_sentences_by_conjunction(sentences):
result = []
for s in sentences:
if result and not is_conjunction(s):
yield result #flush the last group
result = []
result.append(s)
if result:
yield result #flush the rest of the result buffer
>>> groups = group_sentences_by_conjunction(tokens)
如果您的结果可能不适合内存,例如从文件中存储的书中读取所有句子,则使用yield
语句会更好。
如果由于某种原因需要将结果作为列表,请使用
>>> groups_list = list(groups)
结果:
[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]
如果您需要组号,请使用enumerate(groups)
。
is_conjunction
会遇到与其他答案中提到的问题相同的问题。根据需要进行修改以符合您的标准。