我正在寻找将句子分成单词的pythonic方法,并且还将所有单词的索引信息存储在句子中,例如
a = "This is a sentence"
b = a.split() # ["This", "is", "a", "sentence"]
现在,我还想存储所有单词的索引信息
c = a.splitWithIndices() #[(0,3), (5,6), (8,8), (10,17)]
实现splitWithIndices()的最佳方法是什么,python是否有任何我可以使用的库方法。任何帮助我计算单词索引的方法都很棒。
答案 0 :(得分:19)
这是一个使用正则表达式的方法:
>>> import re
>>> a = "This is a sentence"
>>> matches = [(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', a)]
>>> matches
[('This', (0, 3)), ('is', (5, 6)), ('a', (8, 8)), ('sentence', (10, 17))]
>>> b, c = zip(*matches)
>>> b
('This', 'is', 'a', 'sentence')
>>> c
((0, 3), (5, 6), (8, 8), (10, 17))
作为一个单行:
b, c = zip(*[(m.group(0), (m.start(), m.end()-1)) for m in re.finditer(r'\S+', a)])
如果您只想要指数:
c = [(m.start(), m.end()-1) for m in re.finditer(r'\S+', a)]
答案 1 :(得分:9)
我认为返回相应拼接的开始和结束更自然。例如(0,4)而不是(0,3)
>>> from itertools import groupby
>>> def splitWithIndices(s, c=' '):
... p = 0
... for k, g in groupby(s, lambda x:x==c):
... q = p + sum(1 for i in g)
... if not k:
... yield p, q # or p, q-1 if you are really sure you want that
... p = q
...
>>> a = "This is a sentence"
>>> list(splitWithIndices(a))
[(0, 4), (5, 7), (8, 9), (10, 18)]
>>> a[0:4]
'This'
>>> a[5:7]
'is'
>>> a[8:9]
'a'
>>> a[10:18]
'sentence'