tl; dr version
我有可能包含引文的段落(例如“blah blah”,“this one also”等)。现在我必须在python 3.0的帮助下用乳胶风格的引用替换它(例如``blah blah',`this also'等)。
背景
我有很多纯文本文件(超过100个)。现在,我必须制作一个单独的Latex文档,其中包含从这些文件中获取的内容,然后对它们进行少量文本处理。我正在使用Python 3.0来达到这个目的。现在我能够使其他所有内容(如转义字符,部分等)工作,但我无法正确获取引号。
我可以找到带有正则表达式的模式(如here所述),但如何用给定的模式替换它?在这种情况下,我不知道如何使用“re.sub()”函数。因为我的字符串中可能有多个引号实例。有this问题与此有关,但我如何用python实现呢?
答案 0 :(得分:1)
"double-quotes"
和'single-quotes'
。可能还有其他引号(请参阅this question)'s
包含单引号(例如don't
,John's
)。它们的特点是在引号the actresses' roles
)import re
def texify_single_quote(in_string):
in_string = ' ' + in_string #Hack (see explanations)
return re.sub(r"(?<=\s)'(?!')(.*?)'", r"`\1'", in_string)[1:]
def texify_double_quote(in_string):
return re.sub(r'"(.*?)"', r"``\1''", in_string)
with open("test.txt", 'r') as fd_in, open("output.txt", 'w') as fd_out:
for line in fd_in.readlines():
#Test for commutativity
assert texify_single_quote(texify_double_quote(in_string)) == texify_double_quote(texify_single_quote(in_string))
line = texify_single_quote(line)
line = texify_double_quote(line)
fd_out.write(line)
输入文件(test.txt
):
# 'single', 'single', "double"
# 'single', "double", 'single'
# "double", 'single', 'single'
# "double", "double", 'single'
# "double", 'single', "double"
# I'm a 'single' person
# I'm a "double" person?
# Ownership for plural words; the peoples' 'rights'
# John's dog barked 'Woof!', and Fred's parents' 'loving' cat ran away.
# "A double-quoted phrase, with a 'single' quote inside"
# 'A single-quoted phrase with a "double quote" inside, with contracted words such as "don't"'
# 'A single-quoted phrase with a regular noun such as actresses' roles'
输出(output.txt
):
# `single', `single', ``double''
# `single', ``double'', `single'
# ``double'', `single', `single'
# ``double'', ``double'', `single'
# ``double'', `single', ``double''
# I'm a `single' person
# I'm a ``double'' person?
# Ownership for plural words; the peoples' `rights'
# John's dog barked `Woof!', and Fred's parents' `loving' cat ran away.
# ``A double-quoted phrase, with a `single' quote inside''
# `A single-quoted phrase with a ``double quote'' inside, with contracted words such as ``don't'''
# `A single-quoted phrase with a regular noun such as actresses' roles'
(注释注释会停止格式化帖子的输出!)
我们将细分此正则表达式模式,(?<=\s)'(?!')(.*?)'
:
(?<=\s)'(?!')
处理开头的单引号,而(.*?)
则处理引号中的内容。(?<=\s)'
是positive look-behind,只匹配前面有空格(\s
)的单引号。这对于防止匹配缩小的词语非常重要,例如can't
(考虑3,4)。'(?!')
是negative look-ahead,仅匹配不的单引号,后跟另一个单引号(代价2)。(.*?)
会捕获引号之间的内容,而\1
包含捕获。in_string = ' ' + in_string
是因为积极的后视不从头开始捕获单引号line,因此为所有行添加一个空格(然后在切片返回时删除它,return re.sub(...)[1:]
)解决了这个问题!答案 1 :(得分:1)
正则表达式对于某些任务非常有用,但它们仍然有限(有关详细信息,请阅读this)。为这个任务编写一个解析器似乎更容易修复错误。
我为此任务创建了一个简单的函数并添加了注释。如果仍有问题,请询问。
代码(online version here):
the_text = '''
This is my \"test\" String
This is my \'test\' String
This is my 'test' String
This is my \"test\" String which has \"two\" quotes
This is my \'test\' String which has \'two\' quotes
This is my \'test\' String which has \"two\" quotes
This is my \"test\" String which has \'two\' quotes
'''
def convert_quotes(txt, quote_type):
# find all quotes
quotes_pos = []
idx = -1
while True:
idx = txt.find(quote_type, idx+1)
if idx == -1:
break
quotes_pos.append(idx)
if len(quotes_pos) % 2 == 1:
raise ValueError('bad number of quotes of type %s' % quote_type)
# replace quote with ``
new_txt = []
last_pos = -1
for i, pos in enumerate(quotes_pos):
# ignore the odd quotes - we dont replace them
if i % 2 == 1:
continue
new_txt += txt[last_pos+1:pos]
new_txt += '``'
last_pos = pos
# append the last part of the string
new_txt += txt[last_pos+1:]
return ''.join(new_txt)
print(convert_quotes(convert_quotes(the_text, '\''), '"'))
打印出来:
This is my ``test" String
This is my ``test' String
This is my ``test' String
This is my ``test" String which has ``two" quotes
This is my ``test' String which has ``two' quotes
This is my ``test' String which has ``two" quotes
This is my ``test" String which has ``two' quotes
注意:解析嵌套引号是不明确的。
例如:
字符串"bob said: "alice said: hello""
嵌套在适当的语言
BUT:
字符串"bob said: hi" and "alice said: hello"
未嵌套。
如果是这种情况,您可能需要先将这些嵌套引号解析为不同的引号,或者使用括号()
表示嵌套引号消除歧义。