我在固定宽度的文件中有几十万个不稳定的值。我想找到字符串old_values并用new_values中相应位置的字符串替换它们。我可以循环并一次完成这一项,但我几乎可以肯定有一种更快的方式,我不够专业知道。
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N') # and many more
new_values = (' -0', ' -1', ' -2', ' -3', ' -4', ' -5') # and many more
file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data
循环遍历每个值并在每一行上运行.replace似乎很慢。例如:
for x in len(old_values):
line.replace(old_values[x], new_values[x])
有关加快速度的提示吗?
答案 0 :(得分:3)
以下是逐字符遍历数据的代码,如果找到映射则替换它。这假设每个需要替换的数据都是绝对唯一的。
def replacer(instring, mapping):
item = ''
for char in instring:
item += char
yield item[:-5]
item = item[-5:]
if item in mapping:
yield mapping[item]
item = ''
yield item
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = (' -0', ' -1', ' -2', ' -3', ' -4', ' -5')
value_map = dict(zip(old_values, new_values))
file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data
result = ''.join(replacer(file_snippet, value_map))
print result
在您的示例数据中,这给出了:
0000000000000001 -0000000000000000000020000200000000000000000003 -10000100000000000000500000000000000000000000
如果数据符合这种方式,更快的方法是将数据拆分为5个字符的块:
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = (' -0', ' -1', ' -2', ' -3', ' -4', ' -5')
value_map = dict(zip(old_values, new_values))
file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data
result = []
for chunk in [ file_snippet[i:i+5] for i in range(0, len(file_snippet), 5) ]:
if chunk in value_map:
result.append(value_map[chunk])
else:
result.append(chunk)
result = ''.join(result)
print result
这导致您的示例数据中没有替换,除非您删除前导零,然后得到:
000000000000001 -0000000000000000000020000200000000000000000003 -10000100000000000000500000000000000000000000
与上述相同。
答案 1 :(得分:2)
进行替换映射(dict
)可以加快速度:
import timeit
input_string = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000'
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = (' -0', ' -1', ' -2', ' -3', ' -4', ' -5')
mapping = dict(zip(old_values,new_values))
def test_replace_tuples(input_string, old_values, new_values):
for x in xrange(len(old_values)):
input_string = input_string.replace(old_values[x], new_values[x])
return input_string
def test_replace_mapping(input_string, mapping):
for k, v in mapping.iteritems():
input_string = input_string.replace(k, v)
return input_string
print timeit.Timer('test_replace_tuples(input_string, old_values, new_values)',
'from __main__ import test_replace_tuples, input_string, old_values, new_values').timeit(10000)
print timeit.Timer('test_replace_mapping(input_string, mapping)',
'from __main__ import test_replace_mapping, input_string, mapping').timeit(10000)
打印:
0.0547060966492
0.048122882843
请注意,不同输入的结果可能不同,请在实际数据上进行测试。