Question

我正在尝试解析一个文件，其中引用文件用于封装字符串。例如，该文件可能包含如下所示的行：

    "\"Hello there, my friends,\" the tour guide says." me @ swap notify

但它也可能包含这样的行：

    "I'm a dingus who wants to put a backslash at the end of my statements. \\" me @ swap notify

在该示例中，引号不应转义，但应保留单个反斜杠。

我可以使用任何函数来提取完整的引用语句吗？ \ n换行和\ r \ n回车也偶尔出现，所以我想得到那两个，但只有在我把完整的字符串隔离后。

Answer 1

解析字符串部分。您可以使用regular expression或string partition
ast.literal_eval字符串并将其分配给变量。

测试：

>>> import re
>>> import ast
>>> with open('test.txt.') as f:
...  for line in f:
...   m = re.match('(.*) \w+ @ \w+ \w+', line)
...   print ast.literal_eval(m.group(1))
...
"Hello there, my friends," the tour guide says.
I'm a dingus who wants to put a backslash at the end of my statements. \

正则表达式说“匹配任何内容并将其存储为组1 ，最多为空格，单词，空格，@ -sign，空格和单词”。然后，您使用.group(1)语法检索该组。括号定义一个组，请参阅regex documentation。

这是一个试图尽可能贪婪地解析字符串的版本，通过失败并重试直到找到匹配，或者不能匹配：

import re
import ast

def match_line(line):
    while line:
        print "Trying to match:", line
        try:
            return ast.literal_eval(line)
        except SyntaxError, e:
            line = line[:e.offset - 1]
        except ValueError: # No way it would ever match
            break
    return None

with open('test.txt.') as f:
    for line in f:
        match = match_line(line.strip())
        print "Matched:", match
        print

Answer 2

你可以使用正则表达式。虽然通常不建议进行解析，因为除非你有相当简单的输入或输入遵循严格的规则，否则很容易出错。可能有某种解析模块可以更好地处理这个问题（例如csv模块非常适合字段中的引号和转义（如果你有csv）。

txt1 = r'"\"Hello there, my friends,\" the tour guide says." me @ swap notify.'
txt2 = '"I' + "'" + r'm a dingus who wants to put a backslash at the end of my statements. \\" me @ swap notify'

import re
print re.findall(r'"(?:[^"\\]|\\.)+"',txt1)[0]
# "\"Hello there, my friends,\" the tour guide says."
print re.findall(r'"(?:[^"\\]|\\.)+"',txt2)[0]
# "I'm a dingus who wants to put a backslash at the end of my statements. \\"

注意我使用r'xxxxx'语法来避免进一步逃避python的反斜杠（它们已经为正则表达式进行了转义）。

正则表达式"([^"\\]|\\.)+"表示“匹配任何不是”或反斜杠的内容，或者匹配反斜杠以及紧随其后的任何内容。“

从输入中隔离字符串时转义引号

2 个答案: