我有一个字符串列表:
[这是',这是有史以来第三大地震',是的 第三大地震','你记录的历史',u'massive tsunamis, 当他们击中陆地时,造成了广泛的破坏 估计孟加拉湾周围国家有230,000人死亡 和印度洋','大规模的海啸',“你们广泛的破坏”, u'they',u'land',uan估计有230,000人死于国家 在孟加拉湾和印度洋周围,u'an估计有230,000 在孟加拉湾和印度洋周围的人民,你们的国家, u'countries',孟加拉湾和印度洋',海湾', u'Bengal和印度洋',u'Bengal',u'the Indian Ocean']
您可以看到,某些元素包含其他元素,例如:
这是历史上第三大地震'
包含:
这是第三大地震'
你记录的历史'
我如何只选择u'recorded history'
等最精细的粒度元素并丢弃剩余的元素?
答案 0 :(得分:4)
我相信这符合您的要求:
In [14]: allstrings = [u'This', u'the third largest earthquake in recorded history', u'the third largest earthquake', u'recorded history', u'massive tsunamis , which caused widespread devastation when they hit land , leaving an estimated 230,000 people dead in countries around the Bay of Bengal and the Indian Ocean', u'massive tsunamis', u'widespread devastation', u'they', u'land', u'an estimated 230,000 people dead in countries around the Bay of Bengal and the Indian Ocean', u'an estimated 230,000 people', u'countries around the Bay of Bengal and the Indian Ocean', u'countries', u'the Bay of Bengal and the Indian Ocean', u'the Bay', u'Bengal and the Indian Ocean', u'Bengal', u'the Indian Ocean']
In [15]: [s for s in allstrings if not any(t in s for t in allstrings if t != s)]
Out[15]:
[u'This',
u'the third largest earthquake',
u'recorded history',
u'massive tsunamis',
u'widespread devastation',
u'they',
u'land',
u'an estimated 230,000 people',
u'countries',
u'the Bay',
u'Bengal',
u'the Indian Ocean']
列表理解开始很简单。它会从主列表allstrings
中选择满足某些条件的字符串:[s for s in allstrings if ....]
字符串s
必须满足最终列表的条件是:
not any(t in s for t in allstrings if t != s)
如您所见,这会测试t
中allstrings
中的任何其他字符串s
是否在t
中。如果没有此类字符串s
,则'the'
将包含在最终列表中。
实体'they'
中是否包含实体In [25]: u'the' in u'they'
Out[25]: True
In [26]: u' the ' in u' they '
Out[26]: False
?答案取决于我们对实体的意义。如果我们认为答案是否定的,那么我们应该对算法进行微小的改动。最简单的方法似乎是用空格填充每个字符串。举个例子:
In [30]: allstrings = [u'This', u'the third largest earthquake in recorded history', u'the third largest earthquake', u'recorded history', u'massive tsunamis , which caused widespread devastation when they hit land , leaving an estimated 230,000 people dead in countries around the Bay of Bengal and the Indian Ocean', u'massive tsunamis', u'widespread devastation', u'they', u'land', u'an estimated 230,000 people dead in countries around the Bay of Bengal and the Indian Ocean', u'an estimated 230,000 people', u'countries around the Bay of Bengal and the Indian Ocean', u'countries', u'the Bay of Bengal and the Indian Ocean', u'the Bay', u'Bengal and the Indian Ocean', u'Bengal', u'the Indian Ocean']
In [31]: allstr2 = [u' {} '.format(s.strip()) for s in allstrings]
In [32]: [s.strip() for s in allstr2 if not any(t in s for t in allstr2 if t != s)]
Out[32]:
[u'This',
u'the third largest earthquake',
u'recorded history',
u'massive tsunamis',
u'widespread devastation',
u'they',
u'land',
u'an estimated 230,000 people',
u'countries',
u'the Bay',
u'Bengal',
u'the Indian Ocean']
为了实现这一点,我们添加一个添加空格,运行实体检查,然后删除多余空格的步骤:
{{1}}
正如您所看到的,这种细化对于给定的字符串没有区别,但对其他字符串可能没有区别。