编译用户输入

Question

我正在使用Python3。在我的应用程序中，用户可以直接输入正则表达式字符串，应用程序将使用它来匹配某些字符串。例如，用户可以键入\t+。但是，由于无法正确将其转换为正确的正则表达式，因此无法使其正常工作。我已经尝试过了，下面是我的代码。

>>> import re
>>> re.compile(re.escape("\t+")).findall("  ")
[]

但是，当我将正则表达式字符串更改为\t时，它将起作用。

>>> re.compile(re.escape("\t")).findall("   ")
['\t']

注意findall的参数是制表符。我不知道为什么它似乎不能正确显示在Stackoverflow中。

任何人都可以为我指出解决该问题的正确方向？谢谢。

Answer 1

编译用户输入

无论用户输入来自系统的哪一个，我都认为它是字符串：

user_input = input("Input regex:")  # check console, it is expecting your input
print("User typed: '{}'. Input type: {}.".format(user_input, type(user_input)))

这意味着您需要将其转换为正则表达式，这就是re.compile的目的。如果您使用re.compile并且没有提供有效的str来转换为正则表达式，它将引发错误。

因此，您可以创建一个功能来检查输入是否有效。您使用了re.escape，所以我在该函数中添加了一个标志以使用或不使用re.escape。

def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
    try:
        if escape: re.compile(re.escape(regex_from_user))
        else: re.compile(regex_from_user)
        is_valid = True
    except re.error:
        is_valid = False
    return is_valid

print("If you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))

如果您的用户输入是：\t+，您将得到：

>> If you don't use re.escape, the input is valid: True.
>> If you do use re.escape, the input is valid: True.

但是，如果您的用户输入是：[\t+，您将获得：

>> If you don't use re.escape, the input is valid: False.
>> If you do use re.escape, the input is valid: True.

请注意，它确实是无效的正则表达式，但是，通过使用re.escape，您的正则表达式将变为有效。这是因为re.escape 会转义所有特殊字符，将它们视为文字字符。因此，如果您有\t+，则如果您使用re.escape，则将查找以下字符序列：\，t，+，而不是tab character。

检查您的查询字符串

输入要查找的字符串。例如，下面是一个字符串，其中引号之间的字符应为制表符：

string_to_look_in = 'This is a string with a "  " tab character.'

您可以使用repr功能手动检查选项卡。

print(string_to_look_in)
print(repr(string_to_look_in))

>> This is a string with a "    " tab character.
>> 'This is a string with a "\t" tab character.'

请注意，通过使用repr，将显示制表符的\t表示形式。

测试脚本

以下是您尝试所有这些内容的脚本：

import re

string_to_look_in = 'This is a string with a "  " tab character.'
print("String to look into:", string_to_look_in)
print("String to look into:", repr(string_to_look_in), "\n")

user_input = input("Input regex:")  # check console, it is expecting your input

print("\nUser typed: '{}'. Input type: {}.".format(user_input, type(user_input)))


def is_valid_regex(regex_from_user: str, escape: bool) -> bool:
    try:
        if escape: re.compile(re.escape(regex_from_user))
        else: re.compile(regex_from_user)
        is_valid = True
    except re.error:
        is_valid = False
    return is_valid

print("\nIf you don't use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=False)))
print("If you do use re.escape, the input is valid: {}.".format(is_valid_regex(user_input, escape=True)))

if is_valid_regex(user_input, escape=False):
    regex = re.compile(user_input)
    print("\nRegex compiled as '{}' with type {}.".format(repr(regex), type(regex)))

    matches = regex. findall(string_to_look_in)
    print('Mathces found:', matches)

else:
    print('\nThe regex was not valid, so no matches.')

Answer 2

re.escape("\t+")的结果为'\\\t\\+'。请注意，+号以反斜杠转义，现在不再是特殊字符。这并不意味着“一个或多个标签”。

Answer 3

来自外部源的文字\t+与文字字符串"\t+"并不相同。 print("\t+")输出什么？ print(r"\t+")呢？后者等效于接受该文字字符串作为输入用作正则表达式。前者不是。但是，对于这种特定情况，区别并不重要，因为文字制表符的行为应与正则表达式中的\t完全相同。在Ipython会话中思考以下示例：

In [24]: re.compile('\t+').findall('^I')
Out[24]: ['\t']

In [25]: re.compile('\t+').findall("\t")
Out[25]: ['\t']

In [26]: re.compile(r'\t+').findall('^I')
Out[26]: ['\t']

In [27]: re.compile(r'\t+').findall("\t")
Out[27]: ['\t']

In [28]: re.compile(r'\t+').findall(r"\t")
Out[28]: []

我只能得出第一个示例，即没有产生预期输出的示例，在引号字符串中没有文字标签。

此外，re.escape()不适合这种情况。其目的是确保从字面上而不是从正则表达式中处理不可信来源的字符串，以便可以安全地将其用作要匹配的文字字符串。

Python用户输入为正则表达式，如何正确执行？

3 个答案:

编译用户输入

检查您的查询字符串

测试脚本