自定义功能

Question

我有两个pd.Series。

Series_A 包含字符串。

Series_B 包含Series_A的子字符串，并按字符长度排序。

我现在想替换Series_B中列出的Series_A中的字符串部分（请参见下面的代码）。

我想使用Dask库来加快此过程，但是我不知道该怎么做。尤其是如果我要分割Series_A或Series_B或同时分割两者。

#input data (simplified)
Series_A = pd.Series(data=["AAAABC","AAABC","AAACBC"]) #real data: 50.000 strings
Series_B = pd.Series(data=["AAAA","ABC","BC"]) #real data: 800.000 strings

#loop
for element in Series_B:
    Series_A = Series_A.map(lambda x: x.replace(element,""))

#expected output
Series_A_output = pd.Series(data=["","AA","AAAC"])

编辑：

我对建议进行了一些试验，目前，以前的Loop / map方法似乎仍然是最快的。我在做错什么吗？

# =============================================================================
# libraries
# =============================================================================

import dask.dataframe as dd
import os
import time
import pandas as pd

# =============================================================================
# prepare experiment
# =============================================================================

s1 = pd.Series(data=["AAAABC","AAABC","AAACBC"]*(100)) #real data: 50.000 strings
s2 = pd.Series(data=["AAAA","ABC","BC"]*(100)) #real data: 800.000 strings

s1 = s1.to_frame()
s1["matched"] = ""
s1["combined"] = list(zip(s1.iloc[:,0], s1["matched"]))

s1_backup = s1.copy()

# =============================================================================
# custom functions
# =============================================================================

def replacer(x):
    k = 0 
    l = len(s2)
    while len(x) > 0 and k < l:
        x = x.replace(s2[k], "")
        k += 1
    return x

#=========================================================================
# pandas Legacy
# =============================================================================
s1 = s1_backup.copy()
start = time.time()

for element in s2:
    s1["combined"] = s1["combined"].map(lambda x: (x[0].replace(element, ""),""))

end = time.time()
print("Process took: {0:2.2f}min to complete.".format((end-start)/60))
print("Process analyzed: {0:2.0f} elements.".format(len(s1)))
print("Process took: {0:2.4f}s per element.".format((end-start)/len(s1)))
#print('''Process took: 0.00min to complete.
#Process analyzed: 300 elements.
#Process took: 0.0007s per element.''')

#=========================================================================
# pandas with new replacer function
# =============================================================================
s1 = s1_backup.copy()
start = time.time()

for element in s2:
    s1["combined"] = s1["combined"].map(lambda x: (replacer(x[0]),""))

end = time.time()
print("Process took: {0:2.2f}min to complete.".format((end-start)/60))
print("Process analyzed: {0:2.0f} elements.".format(len(s1)))
print("Process took: {0:2.4f}s per element.".format((end-start)/len(s1)))
#print('''Process took: 4.79min to complete.
#Process analyzed: 300 elements.
#Process took: 0.9585s per element.''')

# =============================================================================
# dask Legacy
# =============================================================================
s1 = s1_backup.copy()
s1 = dd.from_pandas(s1, npartitions=10)

start = time.time()

for element in s2:
    s1["combined"] = s1.map_partitions(lambda x: x["combined"].map(lambda y: (y[0].replace(element, ""),"")))

print(s1["combined"].compute())
end = time.time()
print("Process took: {0:2.2f}min to complete.".format((end-start)/60))
print("Process analyzed: {0:2.0f} elements.".format(len(s1)))
print("Process took: {0:2.4f}s per element.".format((end-start)/len(s1)))
#print('''Process took: 0.14min to complete.
#Process analyzed: 300 elements.
#Process took: 0.0270s per element.''')

Answer 1

首先，您可以优化功能：即使您已经获得''，也要遍历第二个系列。您可以考虑使用自定义功能。

import pandas as pd

s1 = pd.Series(data=["AAAABC", "AAABC", "AAACBC"]) #real data: 50.000 strings
s2 = pd.Series(data=["AAAA", "ABC", "BC"])

自定义功能

def replacer(x):
    k = 0 
    l = len(s2)
    while len(x) > 0 and k < l:
        x = x.replace(s2[k], "")
        k += 1
    return x

# and use in the following way
s1 = s1.map(replacer)

黄昏

import dask.dataframe as dd
import os
# you should play with the optimal number of partitions
# if this is not a one-off job
npartitions = os.cpu_count()

s1 = dd.from_pandas(s1, npartitions=npartitions)
s1 = s1.map_partitions(lambda x: x.map(replacer)).compute()

使用Dask遍历pd.Series

1 个答案:

自定义功能

黄昏