如何从字符串中找到子字符串列表的位置?
给出一个字符串:
“这架飞往圣彼得堡的飞机在星期六从沙姆沙伊赫起飞后仅23分钟就在埃及的西奈沙漠坠毁。”
以及子字符串列表:
<'>''','飞机',',','绑定','为','圣','彼得堡',',','坠毁','在','埃及','' s“,'Sinai','desert','just','23','minutes','after','take-off','from','Sharm','el-Sheikh','on' ,'星期六','。']
期望的输出:
>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
(34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
(68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
(110, 119), (120, 122), (123, 131), (131, 132)]
输出的说明,可以使用字符(start, end)
使用s
索引找到第一个子字符串“The”。所以从期望的输出。
因此,如果我们从所需的输出遍历所有整数元组,我们将返回子字符串列表,即
>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
我试过了:
def find_offset(tokens, s):
index = 0
offsets = []
for token in tokens:
start = s[index:].index(token) + index
index = start + len(token)
offsets.append((start, index))
return offsets
还有另一种方法可以从字符串中找到子串列表的位置吗?
答案 0 :(得分:4)
第一个解决方案:
#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]
解决第一个解决方案中问题的第二个解决方案:
def find_offsets(tokens, s):
tid = [list(e) for e in tokens]
i = 0
for id_token,token in enumerate(tid):
while (token[0]!=s[i]):
i+=1
tid[id_token] = tuple((i,i+len(token)))
i+=len(token)
return tid
find_offsets(tokens, s)
Out[201]:
[(0, 3),
(4, 9),
(9, 10),
(11, 16),
(17, 20),
(21, 23),
(24, 34),
(34, 35),
(36, 43),
(44, 46),
(47, 52),
(52, 54),
(55, 60),
(61, 67),
(68, 72),
(73, 75),
(76, 83),
(84, 89),
(90, 98),
(99, 103),
(104, 109),
(110, 119),
(120, 122),
(123, 131),
(131, 132)]
#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]
答案 1 :(得分:1)
如果我们不知道子串,那么除了为每个子列重新扫描整个文本之外别无他法。
如果从数据看来,我们知道这些是文本的顺序片段,以文本顺序给出,那么很容易只扫描文本的 rest 每场比赛。但是,每次都没有削减文本的意义。
def spans(text, fragments):
result = []
point = 0 # Where we're in the text.
for fragment in fragments:
found_start = text.index(fragment, point)
found_end = found_start + len(fragment)
result.append((found_start, found_end))
point = found_end
return result
测试:
>>> spans('foo in bar', ['foo', 'in', 'bar'])
[(0, 3), (4, 6), (7, 10)]
这假设每个片段都出现在正确位置的文本中。您的输出格式未提供错配报告的示例。使用.find
代替.index
可以帮助实现这一目标,但只是部分原因。
答案 2 :(得分:1)
import re
s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
for token in tokens:
pattern = re.compile(re.escape(token))
print(pattern.search(s).span())
<强> RESULT 强>
(0, 3)
(4, 9)
(9, 10)
(11, 16)
(17, 20)
(21, 23)
(24, 34)
(9, 10)
(36, 43)
(44, 46)
(47, 52)
(52, 54)
(55, 60)
(61, 67)
(68, 72)
(73, 75)
(76, 83)
(84, 89)
(90, 98)
(99, 103)
(104, 109)
(110, 119)
(120, 122)
(123, 131)
(131, 132)