我是Stack Overflow的新手,希望有人可以提供以下代码来帮助我。
我正在尝试改编Ascher,Ravenscroft和Martelli Python Cookbook中的一段代码。我想使用字典key:value对(所有文本均为utf-8),将Text
中包含“ long-s”的所有单词替换为用现代小写字母s拼写的等效单词。我可以从现有的制表符分隔的文件中构建字典,而不会出现问题(我在代码中使用了简单的示例字典,以便于编辑),但是我想一次完成所有更改以提高速度和效率。我删除了代码的map
和escape
部分,因为我认为'long-s'不需要转义(尽管我可能错了!)。第一部分工作正常,但是内部函数one_xlat
似乎没有任何作用。最后,它不会返回/打印Text
,并且没有错误消息。我已经在命令行和IDLE中运行了代码,结果相同。我已经在使用和不使用map
和escape
的情况下运行了代码,为了确保可以重命名这些变量,但是我不能完全使其正常工作。有人可以帮忙吗?抱歉,如果我遗漏了一些明显的东西,并非常感谢您。
Ascher,Ravenscroft和Martelli的原始代码:
import re
def multiple_replace(text, adict):
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
return rx.sub(one_xlat, text)
改编版本:
import re
adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."
def word_replace(text, adictCR):
regex_dict = re.compile('|'.join(adictCR))
print(regex_dict)
def one_xlat(match):
return adictCR[match.group(0)]
return regex_dict.sub(one_xlat, text)
print(text)
word_replace(text, adictCR)
答案 0 :(得分:0)
我会这样重写您的代码:
# -*- coding: utf-8 -*-
import re
adictCR = {"handſome":"handsome","ſeated":"seated","veſſels":"vessels","ſea-side":"sea-side","ſand":"sand","waſhed":"washed", "oſ":"of", "proſpect":"prospect"}
text = "The caſtle, which is very extenſive, contains a ſtrong building, formerly uſed by the late emperor as his principal treaſury, and a noble terrace, which commands an extensive proſpect oſ the town of Sallee, the ocean, and all the neighbouring country."
new_s=[]
for g in (m.group(0) for m in re.finditer(r'\w+|\W+', text)):
if g in adictCR:
g=adictCR[g]
new_s.append(g)
然后您可以使用''.join(new_s)
获取新字符串。
注意:模式'\w+|\W+'
仅在具有非ascii文本的Python的最新版本(3.1+)中起作用。您也可以替代split(r'(\W)', str)
,但我认为这不适用于utf-8的Python 2。