所以说我有两个包含相同字符串的列表,但区别不同:
sentences = ["This is a sentence", "so is this"]
phrases = ["This is", "a sentence so", "is this"]
我要做的是检查“短语”列表中的任何元素是否未被句子中的一个元素完全表示,然后相应地拆分“短语”元素。例如,在这种情况下:
"a sentence so"
“短语”中的部分表示在“句子”中的元素1和2中,因此应该在“一个句子”和“所以”之间进行分割,以便创建一个新元素。
“这是”和“短语”中的“这个”应该被有效地忽略,因为它们各自完全对应于“句子”中的一个元素。在此之后,假设我想进行元素计数以确定每个列表中有多少元素,“句子”的结果应该仍为2,但“短语”应该从3到4。
Sentencecount=0
Phrasecount=0
for i in sentences:
Sentencecount+=1
for n in phrases:
#code here should check each element with 'sentences' elements and split accordingly
Phrasecount += 1
#expected result: phrases = ["This is", "a sentence", "so", "is this"]
答案 0 :(得分:4)
嗯,这更难 - 而且更有趣! - 比我预期的要好。
from collections import deque
def align_wordlists(words1, words2):
# Split every element of the word lists
# >>> [e.split(" ") for e in ["This is", "a sentence"]]
# [["This", "is"], ["a", "sentence"]]
words1_split = [e.split(" ") for e in words1]
words2_split = [e.split(" ") for e in words2]
# Assert that the flattened lists are identical
assert [word for split in words1_split for word in split] == \
[word for split in words2_split for word in split]
# Create a queue and two tracking lists
Q = deque(enumerate(words2_split))
result = []
splits = []
# Keep track of the current sublist in words1
words1_sublist_id = 0
words1_sublist_offset = 0
# Keep iterating until the queue is empty
while Q:
sublist_id, sublist = Q.popleft()
sublist_len = len(sublist)
words1_sublist_len = len(words1_split[words1_sublist_id])
words1_remaining_len = words1_sublist_len - words1_sublist_offset
if sublist_len <= words1_remaining_len:
# The sublist fits entirely into the current segment in words 1,
# add sublist untouched to resulting list.
result.append(" ".join(sublist))
# Update the sublist tracking
if (words1_sublist_len - words1_sublist_offset - sublist_len) == 0:
# The sublist filled the remaining space
words1_sublist_id += 1
words1_sublist_offset = 0
else:
# The sublist only filled part of the remaining space
words1_sublist_offset += sublist_len
else:
# Only part of the current sublist fits.
# Split the segment at the point where the left
# part fits into the current segment of words1.
# Then add the remaining right list to the front
# of the queue.
left = " ".join(sublist[:words1_remaining_len])
right = sublist[words1_remaining_len:]
result.append(left)
Q.appendleft((sublist_id, right))
# Keep track of splits
splits.append(sublist_id)
# update indices
words1_sublist_id += 1
words1_sublist_offset = 0
# Combine splits into sublists to get desired result
for split in splits:
if isinstance(result[split], str):
result[split:split+2] = [[result[split], result[split + 1]]]
else:
result[split] = result[split] + [result[split + 1]]
del result[split + 1]
return result
>>> words1 = ["This is a sentence", "so is this"]
>>> words2 = ["This is", "a sentence so", "is this"]
>>> align_wordlists(words1, words2)
['This is', ['a sentence', 'so'], 'is this']
>>> words1 = ["This is a longer", "sentence with", "different splits"]
>>> words2 = ["This is", "a longer sentence", "with different splits"]
>>> align_wordlists(words1, words2)
['This is', ['a longer', 'sentence'], ['with', 'different splits']]
>>> words1 = ["This is a longer", "sentence with", "different splits"]
>>> words2 = ["This is", "a longer sentence with different splits"]
>>> align_wordlists(words1, words2)
['This is', ['a longer', 'sentence with', 'different splits']]
此处使用的算法的高级描述。 您描述的问题归结为这个问题:
对于第二个单词列表中的每个短语,第一个列表中的哪个句子属于哪个?
要回答这个问题,我们在上面的算法中采取了几个步骤:
将words1
和words2
中的单词组拆分为子列表。我们一开始就这样做,因为它可以让以后更轻松地处理短语中的单个单词。
def align_wordlists(words1, words2):
# Split every element of the word lists
# >>> [e.split(" ") for e in ["This is", "a sentence"]]
# [["This", "is"], ["a", "sentence"]]
words1_split = [e.split(" ") for e in words1]
words2_split = [e.split(" ") for e in words2]
为了确保这个算法有效,我添加了一个断言,如果我们忽略每个分裂和空格,它会验证两个句子(即单词列表)是完全相同的:
# Assert that the flattened lists are identical
assert [word for split in words1_split for word in split] == \
[word for split in words2_split for word in split]
要跟踪我们必须查看的词组,我们使用deque
,这是一个Python collections
库的队列。< / p>
# Create a queue and two tracking lists
Q = deque(enumerate(words2_split))
result = []
splits = []
我们使用第二个单词列表的每个短语初始化此队列,并将其与单词列表中的索引相结合。请参阅enumerate
。
由于我们将第二个单词列表中的短语与第一个单词列表中的句子进行比较,我们不知何故必须跟踪我们已经和我们已经在第一个单词列表中查找过了。
# Keep track of the current sublist in words1
words1_sublist_id = 0
words1_sublist_offset = 0
由于我们的队列是我们的工作堆&#34;,只要队列中有项目,我们就会执行以下代码:
# Keep iterating until the queue is empty
while Q:
首先要做的事情是:从队列前面获取项目。我正在解压缩在初始化期间我们在步骤3中推入列表的元组。 sublist_id
是子列表在第二个单词列表中的位置的索引,sublist
是单词的实际列表,即短语。此外,我们还计算短语的长度,稍后我们将需要它。
sublist_id, sublist = Q.popleft()
sublist_len = len(sublist)
现在我们需要检查当前短语是否适合我们当前正在查看的句子。 (在算法开始时,words1_sublist_id
为0,因此我们查看第一个单词列表中的第一个组。)
words1_sublist_len = len(words1_split[words1_sublist_id])
words1_remaining_len = words1_sublist_len - words1_sublist_offset
这是什么意思:&#34;它能适合吗?&#34;如果短语符合句子,则短语可以完全由句子表示。
IF: 词组的长度短于剩余句子的长度,即:我们不必拆分!
if sublist_len <= words1_remaining_len:
由于我们不必拆分,我们只需将词组附加到result
列表(我join
就可以了空格" "
将短语组合回字符串。)
# The sublist fits entirely into the current segment in words 1,
# add sublist untouched to resulting list.
result.append(" ".join(sublist))
由于我们只是将短语放入句子,我们必须更新跟踪以反映我们所取得的进展。在这样做时,我们必须小心尊重句子边界。
# Update the sublist tracking
if (words1_sublist_len - words1_sublist_offset - sublist_len) == 0:
# The sublist filled the remaining space
words1_sublist_id += 1
words1_sublist_offset = 0
else:
# The sublist only filled part of the remaining space
words1_sublist_offset += sublist_len
ELSE: 词组的长度比剩余的句子长,即词组 不能由句子表示。
else:
在这种情况下,我们必须将短语拆分到溢出到下一个句子的位置。我们确定&#34;分裂点&#34;通过句子中剩余单词的数量(例如,如果短语是3个单词长,但句子只剩下2个单词,我们会分开两个单词后的短语。
# Only part of the current sublist fits.
# Split the segment at the point where the left
# part fits into the current segment of words1.
# Then add the remaining right list to the front
# of the queue.
left = " ".join(sublist[:words1_remaining_len])
right = sublist[words1_remaining_len:]
(由于分割的left
部分已完成&#34;,我join
将其变为字符串。right
部分未完成,我们仍然关心它被分成单个词。)
分割词组后,我们可以将left
部分推到我们的result
列表中,因为我们现在知道它已在当前句。我们对right
部分一无所知:它可能适合下一个句子,或者它也可能溢出那个(参见示例#4)。
由于我们不知道如何处理right
部分,因此我们必须将其视为新的词组:即,我们只是将其添加到前面我们的工作队列将在下次运行时处理。
result.append(left)
Q.appendleft((sublist_id, right))
我们的result
列表不会包含我们拆分的点,因此我们会跟踪拆分点。
# Keep track of splits
splits.append(sublist_id)
同样,我们必须跟踪words1
- 列表中的当前位置。因为我们知道我们已经溢出了当前的句子,所以我们可以简单地增加索引并重置偏移量。
# update indices
words1_sublist_id += 1
words1_sublist_offset = 0
如果工作队列为空,我们可以在我们拆分的短语上获取子列表。这个有点棘手:
# Combine splits into sublists to get desired result
for split in splits:
如果我们看到的分裂点是一个字符串,我们可以推断出我们还没有在这个位置进行分割。因此,我们可以通过包含两个单词的列表替换分裂点和之后的项目。 (我们使用split+2
代替split+1
,因为范围不包括在内。)
if isinstance(result[split], str):
result[split:split+2] = [[result[split], result[split + 1]]]
但是,如果分割点是一个列表,我们知道我们已经处于早先已经拆分的位置(即短语已经溢出句子< / em>至少两次,参见示例#4)。
在这种情况下,我们将列在列表后的{em> result[split+1]
附加到列表中,然后使用del
删除现在附加的项目。
else:
result[split] = result[split] + [result[split + 1]]
del result[split + 1]
完成所有操作后,我们可以返回结果!
return result
答案 1 :(得分:0)
我认为这可以直接做到你想要的。根据我关于邻接的问题的答案,我采用了分配句子id的方法 - 等于sentences
列表中句子的索引。然后我将每个单词与它所来自的句子的id配对。
通过这些信息,我处理了这些短语,将其分解为单词。将每个单词与下一个单词进行比较。根据你的说法,它们应该是相同的 - 这些词不是重新排序的,只是分成不同的组。如果短语 - 单词和单词 - 单词不同,则会出错。
因此,对于每个短语,我只是从第一个单词的句子ID开始,并在列表中收集单词。如果句子ID发生变化,那么我们已经从一个句子移动到另一个句子,并且收集的单词被合并到一个子短语中,添加到临时池中,并且集合以空列表重新开始。
在每个短语的最后,我有一个子短语池。如果只有一个子短语(因为整个短语中的句子id是相同的),那么我直接将短语添加到结果中。如果有多个子短语,我将池作为列表添加到结果中。
sentences = ["This is a sentence", "so is this"]
phrases = ["This is", "a sentence so", "is this"]
sent_words = [ (id,w) for id,sent in enumerate(sentences) for w in sent.lower().split()]
sw = iter(sent_words)
new_phrases = []
for phrase in phrases:
last_sid = None
new_phrase = []
words = []
for p_id,p_word in enumerate(phrase.lower().split()):
s_id,s_word = next(sw)
assert p_word == s_word, "Text differs in sentence {}, phrase {}: '{}' vs. '{}'".format(s_id, p_id, s_word, p_word)
#print("[{}] {} : [{}] {}".format(s_id, s_word, p_id, p_word))
if last_sid is None:
last_sid = s_id
words.append(p_word)
elif s_id != last_sid:
new_phrase.append(' '.join(words))
words = [p_word]
else:
words.append(p_word)
else:
if words:
new_phrase.append(' '.join(words))
if len(new_phrase) == 1:
new_phrases.extend(new_phrase)
else:
new_phrases.append(new_phrase)
print(new_phrases)
打印:
['this is', ['a sentence', 'so'], 'is this']
答案 2 :(得分:0)
if [ -z "${RESULT}" ]
和enumerate
中的每个单词创建了展开sentences
,然后phrases
编辑了索引。每个唯一zip
都会描述结果:
tuple
这适用于提到的其他分组,例如:
def enumerate_x(d):
return ((i, w) for i, f in enumerate(d) for w in f.split())
def align(sentences, phrases):
r = {}
for (i1, w1), (i2, w2) in zip(enumerate_x(sentences), enumerate_x(phrases)):
assert w1 == w2, f'{w1} != {w2}' # Py <3.6: '{} != {}'.format(w1, w2)
r.setdefault((i1, i2), []).append(w1)
return [' '.join(r[k]) for k in sorted(r)]
>>> sentences = ["This is a sentence", "so is this"]
>>> phrases = ["This is", "a sentence so", "is this"]
>>> align(sentences, phrases)
['This is', 'a sentence', 'so', 'is this']