鉴于文本中单词的索引,我需要获取字符索引。 例如,在下面的文字中:
"The cat called other cats."
单词“cat”的索引是1。 我需要cat的第一个字符的索引,即c将是4。 我不知道这是否相关,但我使用python-nltk来获取这些单词。 现在,我能想到这样做的唯一方法是:
- Get the first character, find the number of words in this piece of text
- Get the first two characters, find the number of words in this piece of text
- Get the first three characters, find the number of words in this piece of text
Repeat until we get to the required word.
但这将是非常低效的。 任何想法将不胜感激。
答案 0 :(得分:1)
您可以在此处使用dict
:
>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
start, word = dic[word_ind]
ind = word.find(char)
if ind != -1:
return start + ind
...
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21
答案 1 :(得分:0)
import re
def char_index(sentence, word_index):
sentence = re.split('(\s)',sentence) #Parentheses keep split characters
return len(''.join(sentence[:word_index*2]))
>>> s = 'The die has been cast'
>>> char_index(s,3) #'been' has index 3 in the list of words
12
>>> s[12]
'b'
>>>
答案 2 :(得分:0)
使用enumerate()
>>> def obt(phrase, indx):
... word = phrase.split()[indx]
... e = list(enumerate(phrase))
... for i, j in e:
... if j == word[0] and ''.join(x for y, x in e[i:i+len(word)]) == word:
... return i
...
>>> obt("The cat called other cats.", 1)
4