提取两个标记之间的所有子字符串

时间:2020-06-12 10:41:15

标签: python python-3.x python-2.7 re

我有一个字符串:

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"

我想要的是标记start="&maker1"end="/\n"之间的子字符串列表。因此,预期结果是:

whatIwant = ["The String that I want", "Another string that I want"]

我在这里阅读了答案:

  1. Find string between two substrings [duplicate]
  2. How to extract the substring between two markers?

并尝试了此尝试,但未成功

>>> import re
>>> mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> whatIwant = re.search("&marker1(.*)/\n", mystr)
>>> whatIwant.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

该如何解决?另外,我的字符串很长

>>> len(myactualstring)
7792818

2 个答案:

答案 0 :(得分:1)

使用 { "genreId" : 1, "name" : "Comedy", "_links" : { "self" : { "href" : "http://localhost:8080/api/genres/1" }, "genre" : { "href" : "http://localhost:8080/api/genres/1" }, "films" : { "href" : "http://localhost:8080/api/genres/1/films" } } } 来考虑此选项:

re.findall

此打印:

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
matches = re.findall(r'&marker1\n(.*?)\s*/\n', mystr)
print(matches)

以下是正则表达式模式的说明:

['The String that I want', 'Another string that I want']

请注意,&marker1 match a marker \n newline (.*?) match AND capture all content until reaching the first \s* optional whitespace, followed by /\n / and newline 将仅捕获re.findall捕获组中显示的内容,这就是您要提取的内容。

答案 1 :(得分:1)

该如何解决? 我会的:

import re
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
found = re.findall(r"\&marker1\n(.*?)/\n", mystr)
print(found)

输出:

['The String that I want ', 'Another string that I want ']

请注意:

  • &re模式中具有特殊含义,如果要使用文字,则需要对其进行转义(\&
  • .匹配除换行符之外的所有内容
  • 如果您只想要匹配的子字符串列表而不是findall
  • search更适合选择
  • *?是非贪婪的,在这种情况下.*也可以工作,因为.与换行符不匹配,但在其他情况下,匹配结束可能会超出您的期望
  • 我使用了所谓的raw-string(r前缀)来简化转义

阅读模块re documentation,以讨论原始字符串的用法和具有特殊含义的隐式字符列表。