Python正则表达式问题 - [\ S \ s] * vs \ d *

时间:2017-11-21 00:56:44

标签: python regex

我正在尝试在Python 2.7中查找并替换某个字符串。这是我的字符串(显示为原始):

\n\n\nTOSS UP\n\n\n\n1. MATH Short Answer Pablo walks 4 miles north, 6 miles east, and then 2 miles north again. In simplest form, how many miles is he from his starting point?\n\n\n\nANSWER: 6\n\n\n\nBONUS\n\n\n\n1. MATH Short Answer Evaluate the limit as x approaches infinity of x times the quantity negative 1 plus e to the 1 over x.\n\n\n\nANSWER: 1\n\n\n\nTOSS UP\n\n\n\n2. CHEMISTRY Multiple Choice Which of the following is NOT a characteristic of amines?\n\n\n\nW) A fully protonated amine is called an ammonium ion\n\nX) Amines can function as Br\xc3\xb8nsted bases\n\nY) The VSEPR geometry of the nitrogen atom is trigonal planar\n\nZ) Amines can be a hydrogen bond acceptor\n\n\n\nANSWER: Y) The VSEPR geometry of the nitrogen atom is trigonal planar\n\n\n\nBONUS\n\n\n\n2. CHEMISTRY Multiple Choice Of the following elements in their monatomic gaseous states, which has the lowest electron affinity?\n\n\n\nW) BoronX) CarbonY) NitrogenZ) OxygenANSWER: Y) NITROGEN\n\n\n

我正在使用此正则表达式进行搜索,然后进行一些替换:

searchString = (
    r"(TOSS\-UP|TOSSUP|TOSS\s*UP)\s*"
    r"(?P<questionNum>\d{1,2})[\.\)]\s*(?P<category>[A-Z ]+)\s*"
    r"(?i)(Short Answer|Multiple Choice)\s*(?P<tossupQ>[\S\s]*)"
    r"ANSWER\:\s*(?P<tossupA>[\S\s]*)"

    r"\s*BONUS\s*"
    r"(?P<questionNumBonus>\d{1,2})[\.\)]\s*(?P<categoryBonus>[A-Z ]+)\s*"
    r"(?i)(Short Answer|Multiple Choice)\s*(?P<bonusQ>[\S\s]*)"
    r"ANSWER\:(?P<bonusA>[\S\s]*)"
)

我得到的结果是:

{
    "category": 4,
    "questionNum": 1,
    "tossupQ": "Pablo walks 4 miles north, 6 miles east, and then 2 miles north again. In simplest form, how many miles is he from his starting point?\n\n\n\nANSWER: 6\n\n\n\nBONUS\n\n\n\n1. MATH  Short Answer  Evaluate the limit as x approaches infinity of x times the quantity negative 1 plus e to the 1 over x.\n\n\n\nANSWER: 1\n\n\n\nTOSS UP\n\n\n\n2. CHEMISTRY  Multiple Choice  Which of the following is NOT a characteristic of amines?\n\n\n\nW) A fully protonated amine is called an ammonium ion\n\nX) Amines can function as Br\xc3\xb8nsted bases\n\nY) The VSEPR geometry of the nitrogen atom is trigonal planar\n\nZ) Amines can be a hydrogen bond acceptor",
    "tossupA": "Y) The VSEPR geometry of the nitrogen atom is trigonal planar",
    "bonusQ": "Of the following elements in their monatomic gaseous states, which has the lowest electron affinity?\n\n\n\nW) BoronX) CarbonY) NitrogenZ) Oxygen",
    "bonusA": "Y) NITROGEN"
},

但是,当我将行r"ANSWER\:\s*(?P<tossupA>[\S\s]*)"更改为r"ANSWER\:\s*(?P<tossupA>[\d]*)"时,我明白这一点:

{
    "category": 4,
    "questionNum": 1,
    "tossupQ": "Pablo walks 4 miles north, 6 miles east, and then 2 miles north again. In simplest form, how many miles is he from his starting point?",
    "tossupA": "6",
    "bonusQ": "Evaluate the limit as x approaches infinity of x times the quantity negative 1 plus e to the 1 over x.\n\n\n\nANSWER: 1\n\n\n\nTOSS UP\n\n\n\n2. CHEMISTRY  Multiple Choice  Which of the following is NOT a characteristic of amines?\n\n\n\nW) A fully protonated amine is called an ammonium ion\n\nX) Amines can function as Br\xc3\xb8nsted bases\n\nY) The VSEPR geometry of the nitrogen atom is trigonal planar\n\nZ) Amines can be a hydrogen bond acceptor\n\n\n\nANSWER: Y) The VSEPR geometry of the nitrogen atom is trigonal planar\n\n\n\nBONUS\n\n\n\n2. CHEMISTRY  Multiple Choice  Of the following elements in their monatomic gaseous states, which has the lowest electron affinity?\n\n\n\nW) BoronX) CarbonY) NitrogenZ) Oxygen",
    "bonusA": "Y) NITROGEN"
},

为什么 tossup与[\ S \ s] *不匹配,但只与\ d *匹配?任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:1)

原因是你正在使用贪婪的量词。如果您不限制Answer:后跟数字,则允许tossupQ匹配较长的字符串。因此,tossupQ包含所有问题和答案,直到最后Answer:

当您要求Answer:后跟数字时,tossupA只能匹配第一个答案,并且tossupQ必须提前停止以允许此匹配。

您可以通过更改为非贪婪量词来解决此问题:*?。这将使它们匹配与模式的其余部分一致的最短字符串,而不是最长的字符串。

searchString = (
    r"(TOSS\-UP|TOSSUP|TOSS\s*UP)\s*"
    r"(?P<questionNum>\d{1,2})[\.\)]\s*(?P<category>[A-Z ]+)\s*"
    r"(?i)(Short Answer|Multiple Choice)\s*(?P<tossupQ>[\S\s]*?)"
    r"ANSWER\:\s*(?P<tossupA>[\S\s]*?)"

    r"\s*BONUS\s*"
    r"(?P<questionNumBonus>\d{1,2})[\.\)]\s*(?P<categoryBonus>[A-Z ]+)\s*"
    r"(?i)(Short Answer|Multiple Choice)\s*(?P<bonusQ>[\S\s]*?)"
    r"ANSWER\:(?P<bonusA>[\S\s]*)"
)

BTW,[\S\s].相同。如果您希望匹配跨越多行,请使用re.DOTALL标记以使其与换行符匹配。