使用Python

时间:2017-01-24 05:46:22

标签: python regex python-3.x latex

tl; dr version

我有可能包含引文的段落(例如“blah blah”,“this one also”等)。现在我必须在python 3.0的帮助下用乳胶风格的引用替换它(例如``blah blah',`this also'等)。

背景

我有很多纯文本文件(超过100个)。现在,我必须制作一个单独的Latex文档,其中包含从这些文件中获取的内容,然后对它们进行少量文本处理。我正在使用Python 3.0来达到这个目的。现在我能够使其他所有内容(如转义字符,部分等)工作,但我无法正确获取引号。

我可以找到带有正则表达式的模式(如here所述),但如何用给定的模式替换它?在这种情况下,我不知道如何使用“re.sub()”函数。因为我的字符串中可能有多个引号实例。有this问题与此有关,但我如何用python实现呢?

2 个答案:

答案 0 :(得分:1)

设计注意事项

  1. 我只考虑了常规"double-quotes"'single-quotes'。可能还有其他引号(请参阅this question
  2. LaTeX最终引号也是单引号 - 我们不想捕获LaTeX双端引用(例如``LaTeX double-quote'')并将其误认为是单引号引用(无所事事)
  3. 单词收缩和所有权's包含单引号(例如don'tJohn's)。它们的特点是在引号
  4. 的两边都有字母字符
  5. 普通名词(复数所有权)在单词后面有单引号(例如the actresses' roles
  6. 解决方案

    import re
    
    def texify_single_quote(in_string):
        in_string = ' ' + in_string #Hack (see explanations)
        return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]
    
    def texify_double_quote(in_string):
        return re.sub(r'"(.*?)"', r"``\1''", in_string)
    

    测试

    with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
        for line in fd_in.readlines():
    
            #Test for commutativity
            assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))
    
            line = texify_single_quote(line)
            line = texify_double_quote(line)
            fd_out.write(line)
    

    输入文件(test.txt):

    # 'single', 'single', "double"
    # 'single', "double", 'single'
    # "double", 'single', 'single'
    # "double", "double", 'single'
    # "double", 'single', "double"
    # I'm a 'single' person
    # I'm a "double" person?
    # Ownership for plural words; the peoples' 'rights'
    # John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
    # "A double-quoted phrase, with a 'single' quote inside"
    # 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
    # 'A single-quoted phrase with a regular noun such as actresses' roles'
    

    输出(output.txt):

    # `single', `single', ``double''
    # `single', ``double'', `single'
    # ``double'', `single', `single'
    # ``double'', ``double'', `single'
    # ``double'', `single', ``double''
    # I'm a `single' person
    # I'm a ``double'' person?
    # Ownership for plural words; the peoples' `rights'
    # John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
    # ``A double-quoted phrase, with a `single' quote inside''
    # `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
    # `A single-quoted phrase with a regular noun such as actresses' roles'
    

    注释注释会停止格式化帖子的输出!

    说明

    我们将细分此正则表达式模式,(?<=\s)'(?!')(.*?)'

    • 摘要(?<=\s)'(?!')处理开头的单引号,而(.*?)则处理引号中的内容。
    • (?<=\s)'positive look-behind,只匹配前面有空格(\s)的单引号。这对于防止匹配缩小的词语非常重要,例如can't(考虑3,4)。
    • '(?!')negative look-ahead,仅匹配的单引号,后跟另一个单引号(代价2)。
    • 正如this answer中所述,模式(.*?)会捕获引号之间的内容,而\1包含捕获。
    • &#34; Hack&#34; in_string = ' ' + in_string是因为积极的后视从头开始捕获单引号line,因此为所有行添加一个空格(然后在切片返回时删除它,return re.sub(...)[1:])解决了这个问题!

答案 1 :(得分:1)

正则表达式对于某些任务非常有用,但它们仍然有限(有关详细信息,请阅读this)。为这个任务编写一个解析器似乎更容易修复错误。

我为此任务创建了一个简单的函数并添加了注释。如果仍有问题,请询问。

代码(online version here):

the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''


def convert_quotes(txt, quote_type):
    # find all quotes
    quotes_pos = []
    idx = -1

    while True:
        idx = txt.find(quote_type, idx+1)
        if idx == -1:
            break
        quotes_pos.append(idx)

    if len(quotes_pos) % 2 == 1:
        raise ValueError('bad number of quotes of type %s' % quote_type)

    # replace quote with ``
    new_txt = []
    last_pos = -1

    for i, pos in enumerate(quotes_pos):
        # ignore the odd quotes - we dont replace them
        if i % 2 == 1:
            continue
        new_txt += txt[last_pos+1:pos]
        new_txt += '``'
        last_pos = pos

    # append the last part of the string
    new_txt += txt[last_pos+1:]

    return ''.join(new_txt)

print(convert_quotes(convert_quotes(the_text, '\''), '"'))

打印出来:

This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes

注意:解析嵌套引号是不明确的。

例如: 字符串"bob said: "alice said: hello""嵌套在适当的语言

BUT:

字符串"bob said: hi" and "alice said: hello"未嵌套。

如果是这种情况,您可能需要先将这些嵌套引号解析为不同的引号,或者使用括号()表示嵌套引号消除歧义。