我有一些字符串列表,其中一些字符串是整数。我想找到一种方法,根据数字长度快速替换超过100的数字。
['foo', 'bar', '3333'] -> ['foo', 'bar', '99994']
我将在长度大约为100的列表上执行此操作数百万次。我提出的纯python方法如下:
def quash_large_numbers(tokens, threshold=100):
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
BIG_NUMBER_TOKEN = '9999%d'
tokens_no_high_nums = [BIG_NUMBER_TOKEN % len(t) if is_int(t) and int(t) > threshold else t
for t in tokens]
return tokens_no_high_nums
我试图看看我是否可以通过pandas
更快地完成此操作,但对于小型列表来说速度要慢得多,我想通过从系列到列表来回转换所有开销。
def pd_quash_large_numbers(tokens, threshold=100):
BIG_NUMBER_TOKEN = 9999
tokens_ser = pd.Series(tokens)
int_tokens = pd.to_numeric(tokens_ser, errors='coerce')
tokens_over_threshold = int_tokens > threshold
str_lengths = tokens_ser[tokens_over_threshold].str.len().astype(str)
tokens_ser[tokens_over_threshold] = BIG_NUMBER_TOKEN + str_lengths
return tokens_ser.tolist()
我在这里缺少一种更有效的方法吗?可能通过cython?
答案 0 :(得分:1)
新的更快回答
v = np.array(['foo', 'bar', '3333'])
r = np.arange(v.size)
m = np.core.defchararray.isdigit(v)
g = v[m].astype(int) > 100
i = r[m][g]
t = np.array(['9999{}'.format(len(x)) for x in v[i].tolist()])
v = v.astype(t.dtype)
v[i] = t
v.tolist()
['foo', 'bar', '99994']
旧答案
s = pd.Series(['foo', 'bar', '3333'])
s.loc[pd.to_numeric(s, 'coerce') > 100] = s.str.len().map('9999{}'.format)
s
0 foo
1 bar
2 99994
dtype: object
或者
s.tolist()
['foo', 'bar', '99994']
答案 1 :(得分:1)
通过计算文本数字而不是进行任何转换,我获得了很好的加速。这个测试程序将其降低了近80%。它运行原始代码,我的文本检查代码和piRSquared的numpy代码。让最好的代码获胜!
import time
# a thousand 100 item long lists to test
test_data = [['foo', 'bar', '3333'] * 33 for _ in range(1000)]
def quash_large_numbers(tokens, threshold=100):
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
BIG_NUMBER_TOKEN = '9999%d'
tokens_no_high_nums = [BIG_NUMBER_TOKEN % len(t) if is_int(t) and int(t) > threshold else t
for t in tokens]
return tokens_no_high_nums
start = time.time()
result = [quash_large_numbers(tokens, 100) for tokens in test_data]
print('original', time.time() - start)
def quash(somelist, digits):
return [text if len(text) <= digits or not text.isdigit() else '9999' + str(len(text)) for text in somelist]
start = time.time()
result = [quash(item, 2) for item in test_data]
print('textual ', time.time() - start)
import numpy as np
def np_quash(somelist, threshold=100):
v = np.array(somelist)
r = np.arange(v.size)
m = np.core.defchararray.isdigit(v)
g = v[m].astype(int) > threshold
i = r[m][g]
t = np.array(['9999{}'.format(len(x)) for x in v[i].tolist()])
v = v.astype(t.dtype)
v[i] = t
return v.tolist()
start = time.time()
result = [np_quash(item, 100) for item in test_data]
print('numpy ', time.time() - start)
结果
original 0.6143333911895752
textual 0.12842845916748047
numpy 0.3644399642944336