解析文本以替换引号和嵌套引号

时间:2013-04-28 10:38:20

标签: python formatting markup tokenize typography

使用python,我希望“教育”纯文本输入的引号并将它们转换为Context语法。这是一个(递归)示例:

原文:

Using python, I would like "educate" quotes of 
a plain text input and turn them into the Context syntax. 
Here is a (recursive) example:

输出:

Using python, I would like \quotation{educate} quotes of 
a plain text input and turn them into the Context syntax. 
Here is a (recursive) example:

我希望它也能处理嵌套引用:

原文:

Original text: "Using python, I would like 'educate' quotes of 
a plain text input and turn them into the Context syntax. 
Here is a (recursive) example:"

输出:

Original text: \quotation {Using python, I would like \quotation{educate} quotes of 
a plain text input and turn them into the Context syntax. 
Here is a (recursive) example:}

当然,我应该处理边缘情况,例如:

She said "It looks like we are back in the '90s"

上下文引用的规范如下:

http://wiki.contextgarden.net/Nested_quotations#Nested_quotations_in_MkIV

对这种情况最敏感的方法是什么?非常感谢你!

2 个答案:

答案 0 :(得分:3)

这个可以使用嵌套引号,但它不处理边缘情况

def quote(string):
    text = ''
    stack = []
    for token in iter_tokes(string):
        if is_quote(token):
            if stack and stack[-1] == token: # closing
                text += '}'
                stack.pop()
            else: # opening
                text += '\\quotation{'
                stack.append(token)
        else:
            text += token
    return text

def iter_tokes(string):
    i = find_quote(string)
    if i is None:
        yield string
    else:
        if i > 0:
            yield string[:i]
        yield string[i]
        for q in iter_tokes(string[i+1:]):
            yield q

def find_quote(string):
    for i, char in enumerate(string):
        if is_quote(char):
            return i
    return None

def is_quote(char):
    return char in '\'\"'

def main():
    quoted = None
    with open('input.txt') as fh:
        quoted = quote(fh.read())
    print quoted

main()

答案 1 :(得分:0)

如果您确定原始文本在正确的位置有空格,您可以只使用正则表达式:

regexp = re.compile('(?P<opening>(?:^|(?<=\\s))\'(?!\\d0s)|(?<=\\s)")|["\'](?=\\s|$)')

def repl(match):
    if match.group('opening'):
        return '\\quotation{'
    else:
        return '}'

result = re.sub(regexp, repl, s)