我有两个pd.Series。
Series_A 包含字符串。
Series_B 包含Series_A的子字符串,并按字符长度排序。
我现在想替换Series_B中列出的Series_A中的字符串部分(请参见下面的代码)。
我想使用Dask库来加快此过程,但是我不知道该怎么做。尤其是如果我要分割Series_A或Series_B或同时分割两者。
#input data (simplified)
Series_A = pd.Series(data=["AAAABC","AAABC","AAACBC"]) #real data: 50.000 strings
Series_B = pd.Series(data=["AAAA","ABC","BC"]) #real data: 800.000 strings
#loop
for element in Series_B:
Series_A = Series_A.map(lambda x: x.replace(element,""))
#expected output
Series_A_output = pd.Series(data=["","AA","AAAC"])
编辑:
我对建议进行了一些试验,目前,以前的Loop / map方法似乎仍然是最快的。我在做错什么吗?
# =============================================================================
# libraries
# =============================================================================
import dask.dataframe as dd
import os
import time
import pandas as pd
# =============================================================================
# prepare experiment
# =============================================================================
s1 = pd.Series(data=["AAAABC","AAABC","AAACBC"]*(100)) #real data: 50.000 strings
s2 = pd.Series(data=["AAAA","ABC","BC"]*(100)) #real data: 800.000 strings
s1 = s1.to_frame()
s1["matched"] = ""
s1["combined"] = list(zip(s1.iloc[:,0], s1["matched"]))
s1_backup = s1.copy()
# =============================================================================
# custom functions
# =============================================================================
def replacer(x):
k = 0
l = len(s2)
while len(x) > 0 and k < l:
x = x.replace(s2[k], "")
k += 1
return x
#=========================================================================
# pandas Legacy
# =============================================================================
s1 = s1_backup.copy()
start = time.time()
for element in s2:
s1["combined"] = s1["combined"].map(lambda x: (x[0].replace(element, ""),""))
end = time.time()
print("Process took: {0:2.2f}min to complete.".format((end-start)/60))
print("Process analyzed: {0:2.0f} elements.".format(len(s1)))
print("Process took: {0:2.4f}s per element.".format((end-start)/len(s1)))
#print('''Process took: 0.00min to complete.
#Process analyzed: 300 elements.
#Process took: 0.0007s per element.''')
#=========================================================================
# pandas with new replacer function
# =============================================================================
s1 = s1_backup.copy()
start = time.time()
for element in s2:
s1["combined"] = s1["combined"].map(lambda x: (replacer(x[0]),""))
end = time.time()
print("Process took: {0:2.2f}min to complete.".format((end-start)/60))
print("Process analyzed: {0:2.0f} elements.".format(len(s1)))
print("Process took: {0:2.4f}s per element.".format((end-start)/len(s1)))
#print('''Process took: 4.79min to complete.
#Process analyzed: 300 elements.
#Process took: 0.9585s per element.''')
# =============================================================================
# dask Legacy
# =============================================================================
s1 = s1_backup.copy()
s1 = dd.from_pandas(s1, npartitions=10)
start = time.time()
for element in s2:
s1["combined"] = s1.map_partitions(lambda x: x["combined"].map(lambda y: (y[0].replace(element, ""),"")))
print(s1["combined"].compute())
end = time.time()
print("Process took: {0:2.2f}min to complete.".format((end-start)/60))
print("Process analyzed: {0:2.0f} elements.".format(len(s1)))
print("Process took: {0:2.4f}s per element.".format((end-start)/len(s1)))
#print('''Process took: 0.14min to complete.
#Process analyzed: 300 elements.
#Process took: 0.0270s per element.''')
答案 0 :(得分:0)
首先,您可以优化功能:即使您已经获得''
,也要遍历第二个系列。您可以考虑使用自定义功能。
import pandas as pd
s1 = pd.Series(data=["AAAABC", "AAABC", "AAACBC"]) #real data: 50.000 strings
s2 = pd.Series(data=["AAAA", "ABC", "BC"])
def replacer(x):
k = 0
l = len(s2)
while len(x) > 0 and k < l:
x = x.replace(s2[k], "")
k += 1
return x
# and use in the following way
s1 = s1.map(replacer)
import dask.dataframe as dd
import os
# you should play with the optimal number of partitions
# if this is not a one-off job
npartitions = os.cpu_count()
s1 = dd.from_pandas(s1, npartitions=npartitions)
s1 = s1.map_partitions(lambda x: x.map(replacer)).compute()