在python中,找出字符串中的位置是否在一对特定字符序列中的最有效方法是什么?
0--------------16-------------------37---------48--------57
| | | | |
cost=r"a) This costs \$1 but price goes as $x^2$ for \(x\) item(s)."
在字符串cost
中,我想确定某个位置是否被一对$
包围或位于\(
和\)
中。
对于字符串cost
,函数is_maths(cost,x)
将为True
中x
的{{1}}返回[37,38,39,48]
,并在其他任何地方返回False
。 / p>
动机是找出有效的乳胶数学位置,也欢迎使用python的任何其他有效方法。
答案 0 :(得分:2)
您需要将字符串解析到请求的位置,并且如果在有效的LaTeX环境定界符对之内,也要解析到结束定界符,才能使用True
或{{1}来回答}。这是因为您必须处理每个相关的元字符(反斜杠,美元和括号)才能确定其效果。
我知道Latex's $...$
and \(...\)
environment delimiters不能嵌套,因此您不必担心这里的嵌套语句;您只需找到最接近的完整False
或$...$
对。
但是,您不能仅匹配文字\(...\)
或$
或\(
字符,因为每个字符前面都可以带有任意数量的\)
反斜杠。取而代之的是,将输入字符串 tokenize 放在反斜杠,美元或括号上,并依次遍历标记,并跟踪最后匹配的内容以确定其效果(转义下一个字符,以及打开和关闭数学环境)。
如果您超出了要求的位置并且在数学环境部分之外,则无需继续解析;您已经有了答案,可以提前返回\
。
这是我这样的解析器的实现:
False
演示:
import re
_maths_pairs = {
# keys are opening characters, values matching closing characters
# each is a tuple of char (string), escaped (boolean)
('$', False): ('$', False),
('(', True): (')', True),
}
_tokens = re.compile(r'[\\$()]')
def _tokenize(s):
"""Generator that produces token, pos, prev_pos tuples for s
* token is a single character: a backslash, dollar or parethesis
* pos is the index into s for that token
* prev_pos is te position of the preceding token, or -1 if there
was no preceding token
"""
prev_pos = -1
for match in _tokens.finditer(s):
token, pos = match[0], match.start()
yield token, pos, prev_pos
prev_pos = pos
def is_maths(s, pos):
"""Determines if pos in s is within a LaTeX maths environment"""
expected_closer = None # (char, escaped) if within $...$ or \(...\)
opener_pos = None # position of last opener character
escaped = False # True if the most recent token was an escaping backslash
for token, token_pos, prev_pos in _tokenize(s):
if expected_closer is None and token_pos > pos:
# we are past the desired position, it'll never be within a
# maths environment.
return False
# if there was more text between the current token and the last
# backslash, then that backslash applied to something else.
if escaped and token_pos > prev_pos + 1:
escaped = False
if token == '\\':
# toggle the escaped flag; doubled escapes negate
escaped = not escaped
elif (token, escaped) == expected_closer:
if opener_pos < pos < token_pos:
# position is after the opener, before the closer
# so within a maths environment.
return True
expected_closer = None
elif expected_closer is None and (token, escaped) in _maths_pairs:
expected_closer = _maths_pairs[(token, escaped)]
opener_pos = token_pos
prev_pos = token_pos
return False
和其他测试表明转义已正确处理:
>>> cost = r'a) This costs \$1 but price goes as $x^2$ for \(x\) item(s).'
>>> is_maths(cost, 0) # should be False
False
>>> is_maths(cost, 16) # should be False, preceding $ is escaped
False
>>> is_maths(cost, 37) # should be True, within $...$
True
>>> is_maths(cost, 48) # should be True, within \(...\)
True
>>> is_maths(cost, 57) # should be False, within unescaped (...)
False
我的实现常常忽略格式错误的LaTeX问题; >>> is_maths(r'Doubled escapes negate: \\$x^2$', 27) # should be true
True
>>> is_maths(r'Doubled escapes negate: \\(x\\)', 27) # no longer escaped, so false
False
中未转义的$
字符或\(...\)
中转义的\(
和\)
字符以及{{1}中的其他$...$
}个序列,或者\(
个更接近的序列,而前面没有匹配的\(...\)
开启者。这样可以确保即使给定LaTeX本身不会渲染的输入,该功能也可以继续工作。但是,在这种情况下,可以更改解析器 以引发异常或返回\)
。在这种情况下,您需要添加从\(
创建的全局集,并在False
为假(检测嵌套环境分隔符)时针对该集测试_math_pairs.keys() | _math_pairs.values()
并测试(char, escaped)
到发现expected_closer is not None and (token, escaped) != expected_closer
较近而没有开启器问题。