检查字符串中的位置是否在一对特定字符内

时间:2018-10-03 12:43:13

标签: python latex-environment

在python中,找出字符串中的位置是否在一对特定字符序列中的最有效方法是什么?

       0--------------16-------------------37---------48--------57
       |               |                    |          |        |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."

在字符串cost中,我想确定某个位置是否被一对$包围或位于\(\)中。

对于字符串cost,函数is_maths(cost,x)将为Truex的{​​{1}}返回[37,38,39,48],并在其他任何地方返回False。 / p>

动机是找出有效的乳胶数学位置,也欢迎使用python的任何其他有效方法。

1 个答案:

答案 0 :(得分:2)

您需要将字符串解析到请求的位置,并且如果在有效的LaTeX环境定界符对之内,也要解析到结束定界符,才能使用True或{{1}来回答}。这是因为您必须处理每个相关的元字符(反斜杠,美元和括号)才能确定其效果。

我知道Latex's $...$ and \(...\) environment delimiters不能嵌套,因此您不必担心这里的嵌套语句;您只需找到最接近的完整False$...$对。

但是,您不能仅匹配文字\(...\)$\(字符,因为每个字符前面都可以带有任意数量的\)反斜杠。取而代之的是,将输入字符串 tokenize 放在反斜杠,美元或括号上,并依次遍历标记,并跟踪最后匹配的内容以确定其效果(转义下一个字符,以及打开和关闭数学环境)。

如果您超出了要求的位置并且在数学环境部分之外,则无需继续解析;您已经有了答案,可以提前返回\

这是我这样的解析器的实现:

False

演示:

import re

_maths_pairs = {
    # keys are opening characters, values matching closing characters
    # each is a tuple of char (string), escaped (boolean)
    ('$', False): ('$', False),
    ('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')

def _tokenize(s):
    """Generator that produces token, pos, prev_pos tuples for s

    * token is a single character: a backslash, dollar or parethesis
    * pos is the index into s for that token
    * prev_pos is te position of the preceding token, or -1 if there
      was no preceding token

    """
    prev_pos = -1
    for match in _tokens.finditer(s):
        token, pos = match[0], match.start()
        yield token, pos, prev_pos
        prev_pos = pos

def is_maths(s, pos):
    """Determines if pos in s is within a LaTeX maths environment"""
    expected_closer = None  # (char, escaped) if within $...$ or \(...\)
    opener_pos = None  # position of last opener character
    escaped = False  # True if the most recent token was an escaping backslash

    for token, token_pos, prev_pos in _tokenize(s):
        if expected_closer is None and token_pos > pos:
            # we are past the desired position, it'll never be within a
            # maths environment.
            return False

        # if there was more text between the current token and the last
        # backslash, then that backslash applied to something else.
        if escaped and token_pos > prev_pos + 1:
            escaped = False

        if token == '\\':
            # toggle the escaped flag; doubled escapes negate
            escaped = not escaped
        elif (token, escaped) == expected_closer:
            if opener_pos < pos < token_pos:
                # position is after the opener, before the closer
                # so within a maths environment.
                return True
            expected_closer = None
        elif expected_closer is None and (token, escaped) in _maths_pairs:
            expected_closer = _maths_pairs[(token, escaped)]
            opener_pos = token_pos

        prev_pos = token_pos

    return False

和其他测试表明转义已正确处理:

>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0)  # should be False
False
>>> is_maths(cost, 16)  # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37)  # should be True, within $...$
True
>>> is_maths(cost, 48)  # should be True, within \(...\)
True
>>> is_maths(cost, 57)  # should be False, within unescaped (...)
False

我的实现常常忽略格式错误的LaTeX问题; >>> is_maths(r'Doubled escapes negate: \\$x^2$', 27) # should be true True >>> is_maths(r'Doubled escapes negate: \\(x\\)', 27) # no longer escaped, so false False 中未转义的$字符或\(...\)中转义的\(\)字符以及{{1}中的其他$...$ }个序列,或者\(个更接近的序列,而前面没有匹配的\(...\)开启者。这样可以确保即使给定LaTeX本身不会渲染的输入,该功能也可以继续工作。但是,在这种情况下,可以更改解析器 以引发异常或返回\)。在这种情况下,您需要添加从\(创建的全局集,并在False为假(检测嵌套环境分隔符)时针对该集测试_math_pairs.keys() | _math_pairs.values()并测试(char, escaped)到发现expected_closer is not None and (token, escaped) != expected_closer较近而没有开启器问题。