说我有一个看起来像这样的字符串:
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
您会注意到字符串中有很多位置,其中有一个&符号,后跟一个字符(例如“& y”和“& c”)。我需要用字典中的适当值替换这些字符,如下所示:
dict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
最快的方法是什么?我可以手动找到所有的&符号,然后循环通过字典来改变它们,但这似乎很慢。做一堆正则表达式替换似乎也很慢(在我的实际代码中我将有一个大约30-40对的字典)。
感谢任何建议,谢谢。
修改
正如在这个问题的评论中所指出的,我的字典是在运行时之前定义的,并且在应用程序生命周期的过程中永远不会改变。它是ANSI转义序列的列表,其中将包含大约40个项目。我要比较的平均字符串长度大约为500个字符,但最多可达5000个字符(但这些字符很少见)。我目前也在使用Python 2.6。
编辑#2 我接受了Tor Valamos的答案是正确的,因为它不仅提供了有效的解决方案(尽管它不是最佳解决方案),而是考虑了所有其他解决方案并做了大量的工作。比较所有这些。这个答案是我在StackOverflow上遇到过的最好,最有帮助的答案之一。感谢你。
答案 0 :(得分:30)
mydict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
for k, v in mydict.iteritems():
mystr = mystr.replace(k, v)
print mystr
The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog
我冒昧地比较了一些解决方案:
mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])
# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars
from time import time
# How many times to run each solution
rep = 10000
print 'Running %d times with string length %d and ' \
'random inserts of lengths 0-20' % (rep, len(mystr))
# My solution
t = time()
for x in range(rep):
for k, v in mydict.items():
mystr.replace(k, v)
#print(mystr)
print '%-30s' % 'Tor fixed & variable dict', time()-t
from re import sub, compile, escape
# Peter Hansen
t = time()
for x in range(rep):
sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print '%-30s' % 'Peter fixed & variable dict', time()-t
# Claudiu
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
t = time()
for x in range(rep):
multiple_replace(mydict, mystr)
print '%-30s' % 'Claudio variable dict', time()-t
# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))
t = time()
for x in range(rep):
regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print '%-30s' % 'Claudio fixed dict', time()-t
# Andrew Y - variable dict
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
t = time()
for x in range(rep):
mysubst(mystr, mydict)
print '%-30s' % 'Andrew Y variable dict', time()-t
# Andrew Y - fixed
def repl(s):
return mydict["&"+s[0:1]] + s[1:]
t = time()
for x in range(rep):
subs = mystr.split("&")
res = subs[0] + "".join(map(repl, subs[1:]))
print '%-30s' % 'Andrew Y fixed dict', time()-t
Python 2.6中的结果
Running 10000 times with string length 490 and random inserts of lengths 0-20
Tor fixed & variable dict 1.04699993134
Peter fixed & variable dict 0.218999862671
Claudio variable dict 2.48400020599
Claudio fixed dict 0.0940001010895
Andrew Y variable dict 0.0309998989105
Andrew Y fixed dict 0.0310001373291
claudiu和安德鲁的解决方案都保持在0,所以我不得不将其增加到10000次。
我在 Python 3 (因为unicode)中运行它,替换了从39到1024的字符(38是&符号,所以我不想包含它)。字符串长度高达10.000,包括大约980个替换,长度为0-20的可变随机插入。从39到1024的unicode值会导致1和2字节长度的字符,这可能会影响某些解决方案。
mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])
# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars
from time import time
# How many times to run each solution
rep = 10000
print('Running %d times with string length %d and ' \
'random inserts of lengths 0-20' % (rep, len(mystr)))
# Tor Valamo - too long
#t = time()
#for x in range(rep):
# for k, v in mydict.items():
# mystr.replace(k, v)
#print('%-30s' % 'Tor fixed & variable dict', time()-t)
from re import sub, compile, escape
# Peter Hansen
t = time()
for x in range(rep):
sub(r'(&[a-zA-Z])', r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
# Peter 2
def dictsub(m):
return mydict[m.group()]
t = time()
for x in range(rep):
sub(r'(&[a-zA-Z])', dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time()-t)
# Claudiu - too long
#def multiple_replace(dict, text):
# # Create a regular expression from the dictionary keys
# regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
#
# # For each match, look-up corresponding value in dictionary
# return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
#t = time()
#for x in range(rep):
# multiple_replace(mydict, mystr)
#print('%-30s' % 'Claudio variable dict', time()-t)
# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))
t = time()
for x in range(rep):
regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time()-t)
# Separate setup for Andrew and gnibbler optimized dict
mydict = dict((k[1], v) for k, v in mydict.items())
# Andrew Y - variable dict
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))
def mysubst2(somestr, somedict):
subs = somestr.split("&")
return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))
t = time()
for x in range(rep):
mysubst(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict', time()-t)
t = time()
for x in range(rep):
mysubst2(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict 2', time()-t)
# Andrew Y - fixed
def repl(s):
return mydict[s[0:1]] + s[1:]
t = time()
for x in range(rep):
subs = mystr.split("&")
res = subs[0] + "".join(map(repl, subs[1:]))
print('%-30s' % 'Andrew Y fixed dict', time()-t)
# gnibbler
t = time()
for x in range(rep):
myparts = mystr.split("&")
myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
"".join(myparts)
print('%-30s' % 'gnibbler fixed & variable dict', time()-t)
结果:
Running 10000 times with string length 9491 and random inserts of lengths 0-20
Tor fixed & variable dict 0.0 # disqualified 329 secs
Peter fixed & variable dict 2.07799983025
Peter fixed dict 1.53100013733
Claudio variable dict 0.0 # disqualified, 37 secs
Claudio fixed dict 1.5
Andrew Y variable dict 0.578000068665
Andrew Y variable dict 2 0.56299996376
Andrew Y fixed dict 0.56200003624
gnibbler fixed & variable dict 0.530999898911
(**请注意,gnibbler的代码使用不同的字典,其中键没有包含'&'。安德鲁的代码也使用这个替代字典,但它没有太大差别,可能只有0.01 x加速。)
答案 1 :(得分:14)
试试这个,利用正则表达式替换和标准字符串格式化:
# using your stated values for str and dict:
>>> import re
>>> str = re.sub(r'(&[a-zA-Z])', r'%(\1)s', str)
>>> str % dict
'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over the \x1b[0;34mlazy dog'
re.sub()调用替换所有&符号序列,后跟单个字母,模式%(..)s包含相同的模式。
%格式化利用了字符串格式化的功能,可以使字典指定替换,而不是更常出现的位置参数。
替代方法可以使用回调函数直接在re.sub中执行此操作:
>>> import re
>>> def dictsub(m):
>>> return dict[m.group()]
>>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)
这次我使用闭包来从回调函数中引用字典。这种方法可以为您提供更多灵活性。例如,如果您的字符串包含无法识别的代码序列,则可以使用dict.get(m.group(), '??')
之类的内容来避免引发异常。
(顺便说一下,“dict”和“str”都是内置函数,如果你在自己的代码中使用这些名称,你会遇到麻烦。以防万一你不知道。他们'对于这样的问题当然可以。)
编辑:我决定检查Tor的测试代码,并得出结论,它几乎没有代表性,实际上是错误的。生成的字符串甚至没有&符号(!)。下面修订的代码生成一个代表性的字典和字符串,类似于OP的示例输入。
我还想验证每个算法的输出是否相同。下面是一个修改过的测试程序,只有Tor的,我的和Claudiu的代码 - 因为其他人打破了样本输入。 (我认为它们都很脆弱,除非字典基本上映射所有可能的&符号序列,Tor的测试代码正在进行。)这个正确地为随机数生成器播种,因此每次运行都是相同的。最后,我使用生成器添加了一个小变体,避免了一些函数调用开销,以获得较小的性能提升。
from time import time
import string
import random
import re
random.seed(1919096) # ensure consistent runs
# build dictionary with 40 mappings, representative of original question
mydict = dict(('&' + random.choice(string.letters), '\x1b[0;%sm' % (30+i)) for i in range(40))
# build simulated input, with mix of text, spaces, ampersands in reasonable proportions
letters = string.letters + ' ' * 12 + '&' * 6
mystr = ''.join(random.choice(letters) for i in range(1000))
# How many times to run each solution
rep = 10000
print('Running %d times with string length %d and %d ampersands'
% (rep, len(mystr), mystr.count('&')))
# Tor Valamo
# fixed from Tor's test, so it actually builds up the final string properly
t = time()
for x in range(rep):
output = mystr
for k, v in mydict.items():
output = output.replace(k, v)
print('%-30s' % 'Tor fixed & variable dict', time() - t)
# capture "known good" output as expected, to verify others
expected = output
# Peter Hansen
# build charset to use in regex for safe dict lookup
charset = ''.join(x[1] for x in mydict.keys())
# grab reference to method on regex, for speed
patsub = re.compile(r'(&[%s])' % charset).sub
t = time()
for x in range(rep):
output = patsub(r'%(\1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
assert output == expected
# Peter 2
def dictsub(m):
return mydict[m.group()]
t = time()
for x in range(rep):
output = patsub(dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time() - t)
assert output == expected
# Peter 3 - freaky generator version, to avoid function call overhead
def dictsub(d):
m = yield None
while 1:
m = yield d[m.group()]
dictsub = dictsub(mydict).send
dictsub(None) # "prime" it
t = time()
for x in range(rep):
output = patsub(dictsub, mystr)
print('%-30s' % 'Peter generator', time() - t)
assert output == expected
# Claudiu - Precompiled
regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub
t = time()
for x in range(rep):
output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time() - t)
assert output == expected
我之前忘了包含基准测试结果:
Running 10000 times with string length 1000 and 96 ampersands ('Tor fixed & variable dict ', 2.9890000820159912) ('Peter fixed & variable dict ', 2.6659998893737793) ('Peter fixed dict ', 1.0920000076293945) ('Peter generator ', 1.0460000038146973) ('Claudio fixed dict ', 1.562000036239624)
此外,输入的片段和正确的输出:
mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...'
mydict = {'&p': '\x1b[0;37m', '&q': '\x1b[0;66m', '&v': ...}
output = 'lTEQDMAPvksk k←[0;57m Txp vrnhQ GHaO←[0;67mNFY&&a P...'
与我从Tor的测试代码输出中看到的相比:
mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...'
mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...}
output = # same as mystr since there were no ampersands inside
答案 2 :(得分:8)
如果你真的想深入研究这个话题,请看一下:http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
通过迭代字典并替换字符串中的每个元素,显而易见的解决方案需要O(n*m)
次,其中n是字典的大小,m是字符串的长度。
而Aho-Corasick算法在O(n+m+f)
中找到字典的所有条目,其中f是找到的元素的数量。
答案 3 :(得分:6)
如果列表中的键数很大,并且字符串中出现的次数很少(并且大部分为零),那么您可以迭代字符串中&符号的出现,并使用键入的字典通过子串的第一个字符。我不经常在python中编码所以风格可能有点偏,但这是我对它的看法:
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
dict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
def rep(s):
return dict["&"+s[0:1]] + s[1:]
subs = str.split("&")
res = subs[0] + "".join(map(rep, subs[1:]))
print res
当然有一个问题是当有一个来自字符串本身的&符时会发生什么情况,你需要在通过这个过程之前以某种方式逃避它,然后在这个过程之后进行转换。
当然,正如性能问题一样,在典型(也是最坏情况下)的数据集上计算各种方法并进行比较是一件好事。
编辑:将它放入一个单独的函数中以处理任意字典:
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
EDIT2:摆脱不必要的连接,在许多迭代中似乎仍然比前一个更快。
def mysubst(somestr, somedict):
subs = somestr.split("&")
return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
答案 4 :(得分:4)
这是python的C扩展方法
const char *dvals[]={
//"0-64
"","","","","","","","","","",
"","","","","","","","","","",
"","","","","","","","","","",
"","","","","","","","","","",
"","","","","","","","","","",
"","","","","","","","","","",
"","","","","",
//A-Z
"","","","","",
"","","","","",
"","","","","",
"","","","","",
"","","","","33",
"",
//
"","","","","","",
//a-z
"","32","31","","",
"","","","","",
"","","","","",
"","","","","",
"34","","","","30",
""
};
int dsub(char*d,char*s){
char *ofs=d;
do{
if(*s=='&' && s[1]<='z' && *dvals[s[1]]){
//\033[0;
*d++='\\',*d++='0',*d++='3',*d++='3',*d++='[',*d++='0',*d++=';';
//consider as fixed 2 digits
*d++=dvals[s[1]][0];
*d++=dvals[s[1]][1];
*d++='m';
s++; //skip
//non &,invalid, unused (&) ampersand sequences will go here.
}else *d++=*s;
}while(*s++);
return d-ofs-1;
}
我测试过的Python代码
from mylib import *
import time
start=time.time()
instr="The &yquick &cbrown &bfox &Yjumps over the &ulazy dog, skip &Unknown.\n"*100000
x=dsub(instr)
end=time.time()
print "time taken",end-start,",input str length",len(x)
print "first few lines"
print x[:1100]
结果
time taken 0.140000104904 ,input str length 11000000
first few lines
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
The \033[0;30mquick \033[0;31mbrown \033[0;32mfox \033[0;33mjumps over the \033[0;34mlazy dog, skip &Unknown.
它假设能够在 O(n)运行,并且 在My Mobile Celeron 1.6 GHz PC中仅 160 ms(平均) 11 MB 字符串
它也会按原样跳过未知字符,例如&Unknown
将按原样返回
如果您对编译,错误等有任何问题,请告诉我。
答案 5 :(得分:3)
This似乎就像你想要的那样 - 使用RegExps一次多次替换字符串。以下是相关代码:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
print multiple_replace(dict, str)
答案 6 :(得分:3)
定义替换规则的一般解决方案是使用函数使用正则表达式替换来提供映射(请参阅re.sub())。
import re
str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
dict = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
def programmaticReplacement( match ):
return dict[ match.group( 1 ) ]
colorstring = re.sub( '(\&.)', programmaticReplacement, str )
这对于非平凡的替换特别好(例如,任何需要数学操作来创建替代品的东西)。
答案 7 :(得分:3)
这是使用split / join
的版本mydict = {"y":"\033[0;30m",
"c":"\033[0;31m",
"b":"\033[0;32m",
"Y":"\033[0;33m",
"u":"\033[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
myparts = mystr.split("&")
myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
print "".join(myparts)
如果有带有无效代码的&符号,您可以使用它来保存它们
myparts[1:]=[mydict.get(x[0],"&"+x[0])+x[1:] for x in myparts[1:]]
Peter Hansen指出,当有双安瓿时,这会失败。在这种情况下使用此版本
mystr = "The &yquick &cbrown &bfox &Yjumps over the &&ulazy dog"
myparts = mystr.split("&")
myparts[1:]=[mydict.get(x[:1],"&"+x[:1])+x[1:] for x in myparts[1:]]
print "".join(myparts)
答案 8 :(得分:1)
也不确定此解决方案的速度,但您可以循环浏览字典并重复调用内置
str.replace(old, new)
如果原始字符串不太长,这可能会表现得相当好,但是当字符串变长时,它显然会受到影响。
答案 9 :(得分:1)
在Python中执行此批量替换的问题是字符串的不变性:每次您将替换字符串中的一个项目时,将从堆中一次又一次地重新分配整个新字符串。
因此,如果您想要最快的解决方案,您需要使用可变容器(例如list),或者在普通C中编写此机器(或者在Pyrex或Cython中更好)。在任何情况下,我都建议基于简单的有限状态机编写简单的解析器,并逐个输入字符串的符号。
基于正则表达式以类似方式工作的建议解决方案,因为正则表达式在场景后面使用fsm工作。
答案 10 :(得分:1)
由于有人提到使用简单的解析器,我以为我会使用pyparsing做一个。通过使用pyparsing的transformString方法,pyparsing在内部扫描源字符串,并构建匹配文本和插入文本的列表。完成所有操作后,transformString然后''.join就是这个列表,因此在按增量构建字符串时没有性能问题。 (为ANSIreplacer定义的解析操作执行从匹配的&amp; _字符到所需转义序列的转换,并将匹配的文本替换为解析操作的输出。由于只有匹配的序列将满足解析器表达式,因此不需要用于处理未定义的&amp; _序列的解析操作。)
FollowedBy('&amp;')并不是绝对必要的,但是在对所有标记选项进行更昂贵的检查之前,它通过验证解析器实际位于&符号来快捷解析解析过程。
from pyparsing import FollowedBy, oneOf
escLookup = {"&y":"\033[0;30m",
"&c":"\033[0;31m",
"&b":"\033[0;32m",
"&Y":"\033[0;33m",
"&u":"\033[0;34m"}
# make a single expression that will look for a leading '&', then try to
# match each of the escape expressions
ANSIreplacer = FollowedBy('&') + oneOf(escLookup.keys())
# add a parse action that will replace the matched text with the
# corresponding ANSI sequence
ANSIreplacer.setParseAction(lambda toks: escLookup[toks[0]])
# now use the replacer to transform the test string; throw in some extra
# ampersands to show what happens with non-matching sequences
src = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog & &Zjumps back"
out = ANSIreplacer.transformString(src)
print repr(out)
打印:
'The \x1b[0;30mquick \x1b[0;31mbrown \x1b[0;32mfox \x1b[0;33mjumps over
the \x1b[0;34mlazy dog & &Zjumps back'
这肯定不会赢得任何性能竞赛,但如果你的标记开始变得更复杂,那么拥有一个解析器基础将使它更容易扩展。
答案 11 :(得分:0)
>>> a=[]
>>> str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
>>> d={"&y":"\033[0;30m",
... "&c":"\033[0;31m",
... "&b":"\033[0;32m",
... "&Y":"\033[0;33m",
... "&u":"\033[0;34m"}
>>> for item in str.split():
... if item[:2] in d:
... a.append(d[item[:2]]+item[2:])
... else: a.append(item)
>>> print ' '.join(a)
答案 12 :(得分:0)
试试这个
tr.replace( “&安培; Y”,字典[ “&安培; Y”])
tr.replace( “和C”,字典[ “和C”])
tr.replace( “和b”,字典[ “和b”])
tr.replace( “&安培; Y”,字典[ “&安培; Y”])
tr.replace( “&安培; U”,字典[ “&安培; U”])