我的问题有点棘手。我将巨大的数据文件分解为多个块,并对每个块多次应用模糊代码。之后,我将结果整理到一个文件中。我想知道是否可以应用某种循环来重用代码,而不是为每个变量编写代码。下面是示例。
df = pd.read_csv('dec 10.csv')
df1 = df.iloc[0:20000]
df2 = df.iloc[20000:40000]
df3 = df.iloc[40000:60000]
match1 = df1['Customer Name'].map(lambda x: difflib.get_close_matches(x, df1['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
match2 = df2['Customer Name'].map(lambda x: difflib.get_close_matches(x, df2['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
match3 = df3['Customer Name'].map(lambda x: difflib.get_close_matches(x, df3['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
a = match1.append(match2, ignore_index =True)
b = a.append(match3, ignore_index =True)
我正在寻找一种优化方法来编写匹配代码,而不是为每个数据块编写匹配代码,然后在以后进行整理。
答案 0 :(得分:1)
您可以遍历数据帧列表,以便在每次迭代时仅引用df
并避免重复代码:
match = pd.Dataframe()
for df in [df1,df2,df3]:
match_ = df['Customer Name'].map(lambda x: difflib
.get_close_matches(x, df['Customer Name'].values, n=2, cutoff=0.8))
.apply(pd.Series).dropna(axis=0)
match = match.append(match_, ignore_index =True)
答案 1 :(得分:1)
首先,您可以像这样将某些东西分成长度为n
的组
dfgroups = [df[x:x+n] for x in range(0, len(df), n)]
用20000
代替n
,您将获得每个最多20,000个的块。然后,您可以为dfgroups
中的每个项目循环编码。另外,您还希望matches
是可以添加到其自己的列表。最后,出于可读性考虑,对于这么长的一行,您可能只想编写一个mapper
函数,而不要使用大量的lambda。
将所有内容放在一起,可以像这样重写代码。
df = pd.read_csv('dec 10.csv')
# split df into groups of 20,000
dfgroups = [df[x:x+20000] for x in range(0, len(df), 20000)]
matches = [] # empty list to store matches
for dfgroup in dfgroups:
# a function to replace that long line, more readable
# this function will get redefined every loop, using the new `dfgroup` each iteration
# this is optional, and you can instead keep that long line, replacing `df` with `dfgroup`
def mapper(x):
values = dfgroup['Customer Name'].values
result = difflib.get_close_matches(x, values, n=2, cutoff=0.8))
result = result.apply(pd.Series)
result = result.dropna(axis=0)
return result
match = group['Customer Name'].map(mapper) # passing the function as an argument rather than using a lambda
matches.append(match) # append it to the matches list
现在matches
等效于[match1, match2, match3, ...]
,可以像matches[0]
matches[1]
等使用