我从Python中的文本中提取了一些句子。文本存储在字符串中,句子存储在列表中。这是一些示例输入:
text = "This is a text. This is sentence 1. Here is sentence 2. And this is sentence 3."
extracted = ['Here is sentence 2.', 'This is a text']
现在我想根据文本中的年表按顺序排列extracted
列表中的元素。这是我想要的输出:
ordered_result = ['This is a text', 'Here is sentence 2.']
有人知道怎么做吗? 提前谢谢。
答案 0 :(得分:1)
一种方法是使用字典构造具有O(n)复杂度的索引映射。
然后使用此词典将sorted
与自定义键一起使用。
此方法依赖于开头的句子列表。我已经在下面构建了一个,以防你没有这个。
text = "This is a text. This is sentence 1. Here is sentence 2. And this is sentence 3."
extracted = ['Here is sentence 2.', 'This is a text.']
# create list of sentences
full_list = [i.strip()+'.' for i in filter(None, text.split('.'))]
# map sentences to integer location
d_map = {v: k for k, v in enumerate(full_list)}
# sort by calculated location mapping
extracted_sorted = sorted(extracted, key=d_map.get)
['This is a text.', 'Here is sentence 2.']
答案 1 :(得分:1)
直接按原始字符串中的位置对它们进行排序:
ordered_result = sorted(extracted, key=lambda x: text.index(x))
答案 2 :(得分:0)
首选(但稍微复杂一点)的方法是使用正则表达式搜索:
import re
expression = re.compile(r'([A-Z][^\.!?]*[\.!?])')
text = "This is a text. This is sentence 1. Here is sentence 2. And this is sentence 3."
# Find all occurences of `expression` in `text`
match = re.findall(expression, text)
print match
# ['This is a text.', 'This is sentence 1.', 'Here is sentence 2.', 'And this is sentence 3.']
执行此操作的简单(但更简单)方法是将其拆分为". "
,然后按时间顺序排列句子列表。唯一的缺点是你丢失了标点符号。
text = "This is a text. This is sentence 1. Here is sentence 2. And this is sentence 3."
splitt = text.split(". ")
print splitt
# splitt = ['This is a text', 'This is sentence 1', 'Here is sentence 2', 'And this is sentence 3.']